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In Memoriam 




On November 23, 1995, Hewlett-Packard lost a close friend and a consistent source of 
inspiration with the death of Barney Oliver. Barney was a towering figure in HP's his- 
tory, but he was especially close to HP Laboratories, the organization he founded and 
directed for almost 25 years. Some of his technical contributions were described in 
more than a dozen HP Journal art icles. But his technical achievements were just part 
of the legacy he leaves. No one who ever worked with or knew Barney can forget his 
insistence on excellence and innovation. He was a great intellect and a master inven- 
tor whose interests and expertise spanned many scientific disciplines, lie will be 
deeply missed. 
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In this Issue 



Symmetric multiprocessing means that a system can distribute its workload 
evenly over multiple CPUs. Thus, coupled with high-performance memory and 
I/O subsystems, a symmetric multiprocessing system is able to provide balanced 
high performance. The article on page 8 begins a group of articles that describe 
a new family of symmetric multiprocessing workstations and servers that have 
both low cost and high performance and satisfy a wide range of customer 
needs. First, there are the HP 9000 J-class high-end workstations and the HP 
9000 K-class servers, which run the HP-UX* operating system, and then there 
are the HP 3000 Series 9x9KS servers, which run the MPE/iX operating system. 
The J-class workstation provides up to two-way symmetric multiprocessing, and 
the K-class server provides up to four-way symmetric multiprocessing. These systems are based on the 
superscalar PA-RISC processor called the PA 7200 (page 25) and a high-speed memory bus called the 
Runway bus (page 18). 

The Runway bus, which is the backbone of these J/K-class platforms, is a new processor-to-memory- 
and-l/0 interconnect that is ideally suited for for one-way to four-way symmetric multiprocessing for 
high-end workstations and midrange servers. The bus includes a synchronous, 64-bit, split-transaction, 
time multiplexed address and data bus that is capable of sustained memory bandwidths of up to 768 
Mbytes per second in a four-way system. 

One of the design goals for the Runway bus was to support the PA 7200 and future processors. The PA 
7200 is an evolution of the high-performance, single-chip superscalar PA 7100 CPU design. The processor 
and the Runway bus are designed for a bus frequency of 120 MHz in a four-way multiprocessor system, 
which enables the sustained memory bandwidth of 768 Mbytes per second mentioned above. The PA 
7200 contains all the circuits required for one processor in a multiprocessor system except for external 
cache arrays. Among some of the features contained in the PA 7200 are a new data cache organization, 
a prefetching mechanism, and two integer ALUs for general integer superscalar execution. The PA 7200 
is described on page 25 . 

The increased functionality and higher operating frequency of today's VLSI chips have created a corre- 
sponding increase in the complexity of the verification process. In fact, chip verification activities now 
consume more time and resources than design. The article on page 34 describes the functional and 
electrical verification process used for the PA 7200 processor to ensure its quality and correctness. 
Since the design of the PA 7200 was based on the PA 7100 processor, verification could begin very early 
in the design because the same modeling language and simulator used for the PA 7100 could be used for 
the PA 7200. The article also describes debugging activities performed and the testability features pro- 
vided on the PA 7200. 

After investigating ways to improve customer application performance by observing existing platforms, 
the HP 9000 J/K-class design team determined that memory capacity, memory bandwidth, memory 
latency, and system-level parallelism (multiple CPUs and I/O devices all accessing memory in parallel) 
were key elements in achieving high performance. As the article on page 44 describes, a major improve- 
ment in memory bandwidth was achieved through system-level parallelism and memory interleaving, 
which were designed into the Runway bus and the J/K-class memory subsystem. 

Cache coherency refers to the consistency of data between processors (and associated caches), memory 
modules, and I/O devices. For the HP 9000 J/K-class systems, a scheme called hardware cache coherent 
I/O was introduced. This technique involves the I/O system hardware in ensuring cache coherency, 
thereby reducing memory and processor overhead and contributing to greater system performance. 
Cache coherent I/O is discussed in the article on page 52. 

The articles on pages 60 and 68 are more papers from the proceedings of HP's 1995 Design Technology 
Conference (DTC) The first article describes a 1.0625-Gbit/s Fibre Channel transmitter and receiver chipset. 
About three years ago HP introduced the first commercially available, two-chip, 1.4-gigabit-per-second, 
low-cost, serial data link interface, the HP HDMP G-link chipset. The new chipset, the HP HDMP-1512 
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(transmitter) and the HP HDMP-1514 Ireceiverl are a low-cost gigabit solution for Fibre Channel applica- 
tions. The chipset implements the Fibre Channel FC-0 physical layer specification at 1.0625 Gbits/s. The 
transmitter features 20:1 data multiplexing with a comma character generator and a clock synthesis 
phase-locked loop, and includes a laser driver and a fault monitor for safety. The receiver performs 
clock recovery, 1:20 data demultiplexing, comma character detection, word alignment, and redundant 
loss-of-signal alarms for eye safety. 

The other DTC paper (page 68) discusses using the traditional software code inspection process for 
inspecting hardware descriptions written in Verilog HDL (hardware description language). The code 
inspection process for software development has been around for awhile and has proven itself to be an 
effective tool for finding design and code defects and sharing best practices among software engineers. 
The authors found that except for some issues specific to HDL, the format and results of their inspection 
process were very similar to the standard software inspection process. 

The Telecommunications Industry Association (TIA) has released two standards (IS-95 and IS-97) that 
specify the various measurements required to ensure the compatibility of North American CDMA (code 
division multiple access) cellular transmitters and receivers. CDMA, which is used by the cellular tele- 
phone industry, is a class of modulation that uses specialized codes to provide multiple communication 
channels in a designated segment of the electromagnetic spectrum. The article on page 73 provides a 
tutorial overview of the operation of the algorithms in the HP 83203B CDMA cellular adapter, which is 
designed to make the base station measurements specified in the TIA standards. The article also covers 
the general concepts of CDMA signals and measurement and some typical measurements made with 
the HP 83203B. 

C.L Leath 
Managing Editor 



Cover 

The HP 9000 J/K-class servers and workstations and the HP 3000 Series 9x9KS servers are system-de- 
signed for high performance and low cost, meaning that all of the boards have design features that opti- 
mize their functionality specifically for these systems. The cover is a group photograph of the boards. 
Individual descriptions and photos can be found in the article on page 8. 



What's Ahead 

Articles planned for the April issue include eight articles on the Common Desktop Environment (CDE) for 
systems based on the UNIX ' operating system and articles on the PalmVue mobile patient data system, 
the HP G1009A protein analyzer, and a power module for cellular telephones. 
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Symmetric Multiprocessing 
Workstations and Servers 
System-Designed for High 
Performance and Low Cost 

A new family of workstations and servers provides enhanced system 
performance in several price classes. The HP 9000 Series 700 J-class 
workstations provide up to 2-way symmetric multiprocessing, while the 
HP 9000 Series 800 K-class servers (technical servers, file servers) and 
HP 3000 Series 9x9KS business-oriented systems provide up to 4-way 
symmetric multiprocessing. 

by Matt J. Marline, Brendan A. Voge, Loren P. Staley, and Badir M. Monsa 



Blending high performance awl low cost, a new family of 
workstations and servers has been designed to help main- 
tain HP's leadership in system performance, price/perfor- 
mance, system support, and system reliability. This article 
and the accompanying articles in this issue describe the 
design and implementation of Ihc HP 90011 J-class work- 
stations, which are high-end workstations running the 
HP-UX* operating system, the HP 9000 K-class servers, 
which are a family of midrange technical and business serv- 
ers running the HP-l'X operating system, and the HP 3000 
Series 9x9KS servers, which are a family of midrange busi- 
ness servers running the MPE/iX operating system. In this 
issue, these systems will be referred to collectively as J/K- 
class systems. 

The goals of the the design team were to achieve high per- 
formance and low cost, while at the same lime creating a 
broad family of systems that would share many of the same 
components and meet a wide range of customer needs. The 
challenge was to create a list of requirements that would 
meet the needs of the three different target markets: the 
UNIX '-system-based w orkstation market, the UNIX-system- 
based server market, and Hewlett-Packard's proprietary 
MPE/iX-syst em-based server market. The basic require- 
ments for these systems were to deliver leadership symmet- 
ric multiprocessing performance, memory performance, and 
capacity, along with exceptional I/O performance. Balanced 
system performance was the overall goal. 

Hardware Features 

AD of the J/K-class platforms are built around the same basic 
building blocks (see Fig. 1 ), The backbone of these systems 
is the high-speed processor-memory bus called the Runway 
bus. This is a 640-to-768-Mbyte/s (peak sustained bandwidth ), 
C4-bit-wide bus that connects the processors, system main 
memory', and the I/O adapter (bus converter). The Runway 
bus is described in more detail in the article on page IS. The 
I/O adapter provides coiuieetions to two HP-HSC (Hewlett- 
Packard high-speed system connect) buses, providing a raw 



I/O bandwidth between 12S Mbytes/s and 1C0 Mbytes per 
second (95 to 116 Mbytes/s peak sustained bandwidth ). The 
HP-HSC bus is an extension of the GSC (General System 
Connect) bus used in earlier workstations. 1 

In addition to the Runway and HP-HSC buses, the J/K-class 
systems also support a connectivity I/O bus. In die HP 9000 
J-class workstation systems, the connectivity I/O bus is 
EISA (Extended Industry Standard Architecture); it has a 
peak bandwidth of 32 Mbytes/s. In the IIP 9000 K-class and 
HP 3000 Series 9x9KS server systems, the connectivity I/O 
bus is the IIP Precision Bus (HP-PB). The servers have one 
or two four-slot HP-PB adapters. Each HP-PB has a peak 
bandwidth of 32 Mbytes/s. 

Processor 

The core of the J/K-ciass systems is a high-performance pro- 
cessor module that interfaces directly to the Runway bus. It 
is based on the HP PA 7200 CPU chip, 2 a PA-RISC super- 
scalar processor, which is an evolution of the high-perfor- 
mance, single-chip, superscalar PA 7100 processor. The PA 
7200 incorporates a high-speed Runway bus interface, a new 
data cache organization with an on-chip assist cache, data 
prefetching, and two integer ALUs. This microprocessor is 
fabricated using HP's 0.55-micrometer CMOS process and 
delivers reliable performance up to 120 MHz. More informa- 
tion on the PA 7200 can be found in the article on page 25. 
Fig, 2 is a photograph of the processor module. 

The tables on page 10 indicate the processor speeds for 
each of the platforms in the J/K-class family. Table I is for 
the HP-UX workstation systems. Table II is for the HP-UX 
symmetric multiprocessing servers, and Table III is for die 
MPE/iX symmetric multiprocessing servers. 

System Board 

Central to the J/K-class systems is the system circuit board 
(Fig. 3). This printed circuit board contains all the circuit ry 
required for implementing the Runway bus and connectors 
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Fig. 2. Processor module. 



for I he processors. the master memory com roller, ami the 
I/O adapter. The bootstrap code and other system-specific 
hardware arc also on Che system hoard. For the entire family 
of J/K-rlass systems, there are only three system board de- 
signs: one for the I-way or 2-way symmetric multiprocessing 
workstation configuration (J class), one for the uniproces- 
sor server configuration ( K 100). and one for the 1-way to 
4-way symmetric multiprocessing server systems (K2.\0, 
K4x0). 

In the workstation systems, the system board includes the 
Runway bus and system dependent hardware menlioned 
above, the complete memory system including the connec- 
tors for the memory modules (SIMMs), most of the Circuitry 
required for the system's built-in I/O functionality (core I/O), 
and power supply management and control circuits. Five I/O 
slots are provided for system I/O expansion. These five slots 
are shared: a combination of EISA and HP-HSC cards can be 
installed] With a maximum of four EISA cards or three II P- 
I ISl ' cards. For example, a system could have four EISA 
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Table I 

HP 9000 J-Class Processor Speeds 

Processor Processor 
Slots Speed 

J200 2 100 MHz 

J210 2 120MIIz 



Model 



Table II 

HP 9000 K-Class Processor Speeds 



Model 


Processor 


Processor 


Slots 


Speed 


K100 


1 


100 MHz 


K200 


4 


100 MHz 


K210 


4 


120 MHz 


K400 


4 


100 MHz 


K410 


4 


120 MHz 






Table III 




HP 3000 Series 9x9KS Processor Speeds 


Series 


Processor 


Processor 


Slots 


Speed 


939 KS 


1 


80 MHzt 


959 KS 


4 


Kin MHz 


969 KS 


4 


120 MHz 



t Effective Processor Speed 

cards and one HP-HSC card, or three EISA and two HP-BSC 
cards, or t wo EISA and three HP-HSC cards. 

The symmetric multiprocessing server system board in- 
dndes the Runway bus and system dependent hardware as 
described above, along with slots for a separate core I/O 
card, an optional expansion HP-HSC I/O carrier card, one or 
two lG-byte memory carriers, and Tour or eight HP-PB slots. 
Four Runway slots are provided for the processor modules. 
Depending on the processor used in the system, the Runway 
bus operates at 100 MHz or 120 MHz. 

The uniprocessor system board ( HP 9000 Model K100) has a 
single processor, all memory controllers and SIMM slots, a 
core I/O card slot, and four HP-PB slots. 

System Firmware 

All of the .I/K-class systems share a common firmware base 
that tests and initializes the system on power-up. This code 
is a combination of PA-RISC assembly code and C. It was a 
design goal to support all of the server products using the 
same firmware and to have a common firmware base for the 
technical workstation products. The code was designed in a 
very' modular fashion so that the code base could be easily- 
ported to the various system platforms. 

The system firmware is designed to be very robust. For ex- 
ample, during memory configuration and test, it uses a com- 
bination of bank and page deallocation to deconfigure mem- 
ory containing hard errors, allowing the user to continue 
using the system until the failing memory can be replaced. 



Similarly, processors thai fail self-test are deconfigured and 
the system boot process is continued. 

In addition to providing a robust system to the customer, the 
system firmware allows designers and the manufacturing 
processes easy access to system test and configuration of 
hardware and firmware features. Some of these features 
allow enabling or disabling of processor cache prefetching, 
full memory test or memory initialization only, and so on. 
This helped in the system debug effort by speeding the boot 
process and making it possible to disable certain functions 
while searching for the root cause of a bug in the system. 

Another feature buili into the system firmware during the 
system debug process was a debug interface Ihal would 
allow the lab engineers to set soft breakpoint and step 
through instruction execution one instruction at a time. This 
tool proved to be quite valuable, providing increased visibility 
into system behavior and the system state. 

The J/K-class firmware is installed in flash EPROM. The 
firmware can be updated through the system offline diag- 
nostic environment. If for any reason the system firmware 
needs to be modified, it can easily be upgraded by loading a 
new firmware image from tape or another medium into 
system memory and then loading it into the firmware flash 
EPROM. 

The result of these design choices is system firmware that 
provides flexible functionality, reliable system test and ini- 
tialization, and some tolerance for certain types of failed 
components in the system boot process. 

High-Performance Memory 

Memoiy performance was highly important throughout Ihe 
J/K-class system design and implementation. The J/K-class 
memory subsystem is designed with consideration for high 
bandwidth, low latency, and expandability from 32M bytes 
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to 2G bytes. It is capable of interleaving memory accesses 
across 32 banks of memory. The memory system is built 
around the master memory controller ( MMC ). which inter- 
faces to the high-speed Runway bus. The MMC communi- 
cates with up to eight slave memory controllers (SMC) on 
one or two memory carriers (see Fig. 4 and the article on 
page 44). Also on the memory carriers are data multiplexers 
and pairs of SIMMs (single-inline memory 7 modules). This 
design results in a high-bandwidth, interleaved. 2G-byle 
memory subsystem. As 64M-bit DRAMs become cost- 
effective, the 2G-byte limit will increase to 3.75G bytes of 
main memory. Table IV shows the maximum amounts of 
memory available in various J/K-class systems. 



System Type 

HP 9000 
HP 9000 

HP 3000 



Table IV 
Maximum Memory 

Model or Series 

Model J2x0 



Model K100 
Model K2x0 
Model K4x0 

Series 939KS 
Series 959KS 
Series 969KS 



Maximum Memory 

1024M Bytes 

512M Bytes 
1024M Bytes 
2048M Bytes 

I856M Bytes 
2048M Byt es 
2048M Bvtes 



Support ing a high-density and high-performance memory 
system with industry-standard memory SIMMs would have 
resulted in a costly memory system that would not have 
performed at the desired levels. Instead of the industry- 
standard approach, a denser memory module was designed. 
These memory modules (Fig. 5) are actually a dual-inline 
design, although they are still referred to as SIMMs, hi the 
J/K-class systems, these dual SIMMs arc inserted in pairs, 
providing two separately addressable. 128-bit, ECC (error 
correcting code) protected banks of memory ( 1 1 1 bits 
including ECC check bits). Each dual SIMM provides 72 bits 
of the two 144-bit banks. Using 4M-bit or lGM-bit DRAMs, 
the SIMMs are available in lGM-byte and 64M-bytc sizes. 
While these memory modules are not standard, there no 
IIP proprietary technology in them, helping to keep memory 
pricing very competitive with the industry. 

I/O Adapter 

The J/K-class I/< ) adapter ( bus converter) interfaces between 
the Runway bus and the HP-HSC I/O bus. The I/O require- 
ments for a J/K-class system call for multiple I/< ) buses, so 
the I/O adapter package contains two fully independent bus 
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Fig. 4. Mi mciry carrier l">anl 



Fig. 5. Dual-inline memory module. 

converters (see Fig. 6a). To maximize system flexibility, the 
I/O adapter is designed to support a range of bus frequen- 
cies on either bus. thus requiring a full synchronizer. Fig. 6b 
is a block diagram of the I/O adapter. 

The HP-HSC bus only has a 32-bit address space, while the 
Runway bus supports a 40-bit address space. This requires 
an address translation mechanism to provide the additional 
eight address bits. The processor's aggressive data prefetch- 
ing requires a new mechanism for DMA (direct memory 
access) to coexist with this processor feature. Hardware 
cache coherent I/O solves these two problems (see article, 
page 52). Prefetching is also included in the HP-HSC bus to 
reduce memory read latency and increase I/O bandwidth. All 
of these features required additional hardware support in 
the I I I adapter. 

According to the PA-RISC architecture definition, a bus con- 
verter also needs to provide the registers to configure address 
space, enable and disable features, log errors, manipulate 
the TLB. and provide diagnostic access. Therefore, these 
registers are included in the I/O adapter. 

The J/K-cUiss systems required several other hardware fea- 
tures that by default were put into the I/O adapter. Among 
these is the hardware to interface to external components 
implementing the processor dependent hardware (PDH) 
necessary to provide boot firmware, stable storage for 
system configuration infonnation and error logging, and 
scratch RAM. The I/O adapter also provides a real-time 
clock for keeping track of time when power is off. 

Basic VLSI support of scan-based testing, both internal and 
boundary (JTAG or IEEE 1149.1 ). is built into the I/O adapter, 
along with double-strobe capability for speed path testing 
and built-in self-test (BIST) for the RAM structures. 

Finally, it was desired that the design be done in a modular 
fashion, enabling future designs to easily borrow portions of 
the design for future enhancements or to lower costs. This 
required that the chip be designed with well-defined and 
simple interfaces. The synchronizers made very natural 
places to define the boundaries of these modules. All of 
these requirements led to a modular, synchronizer qucuc- 
coupled, hardware cache coherent, dual bus converter 
design. 

Core I/O Functionality 

The basic I/O requirements for both the workstation and the 
server systems include 20-Mbyte/s fast-wide SCSI (Small 
Computer System Interface) for system disk connectivity, 
")-Mbyte/s single-ended SCSI for archival storage, and an 
IEEE H02.3 LAN interface for networking. The HP-UX 
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Fig. 6. (a) There are two fully independent I/O adapters In the I/O 
adapter package. (b) I/O adapter bloc k diagram. 

systems also include a Bitronics parallel interface port, 
keyboard and mouse connections, and serial I/O pons as 
pail of the core I/O functionality. A photograph of the K- 
class core I/O board is shown in Fig. 7. The workstation 
model adds high-quality audio input and output to the 
built-in core I/O. 

The server system includes additional serial ports and an 
integrated modem for remote service access. The server 
systems also have a remote service console access port lo 
allow remote servicing of hardw are and software by 
Hewlett-Packard's customer support organizations. 



I/O Expansion 

On the HP 9000 K-ciass server systems, several configura- 
tions support various system I/O needs (see Table V). As a 
minimum, the system comes with one 32-MHz HP-HSC bus 
slot for expansion I/O. This slot is in a compact 3-by-5-inch 
form factor. As I/O needs increase. lhe system can be up- 
graded (o provide four 40-MHz HP-HSC slots in addition to 
the one 32-MHz HP-HSC slol (Model K4x0 only). In addition 
(o the HP-HSC slots, Lhe K-class server has four or eight 
Hewlet I -Packard Precision Bus (HP-PB) slots. These slots 
are configured such thai lhe user can Install up to four 
double-high HP-PB cards and slill have four single-high 
HP-PB card slots available. 

Table V 

HP 9000 K-Class I/O Expansion Capabilities 



Model 


HP-HSC 


HP-PB 


Peak Sustained 


Bus Slots 


Slots 


I/O Bandwidth 


K100 


1 


4 


05 Mbytes/s 


K200 


1 


4 


211 Mbytes/s 


K210 


1 


4 


211 Mbytes/s 


K400 


5 


8 


211 Mbytes/s 


K 1 Id 


5 


8 


211 Mbytes/s 



t Combined bandwidtfi ol the two HP-HSC buses. 

In (he HP 0000 .l-class workstation con figural ions, the system 
supports an 8-MHz EISA bus (maximum of four slots) and a 
40-MHz HP-HSC expansion I/O bus (maximum of three slots). 
These slots provide the workstation user wilh a great deal of 
flexibility in coitfiguring I/O devices and meeting high-speed 
I/O requirements ( see Table VI). 

Table VI 

HP 9000 Model JZxO I/O Expansion Capabilities 



1/0 Slot 


Configuration 


Slot 0 


HP-HSC 


Slot 1 


HP-HSC or EISA 


Slot 2 


HP-HSC or EISA 


Slot 3 


EISA 


Slot 4 
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Fig. 7. K-class tore I/O board. 
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A number of I/O cards are currently available for use in the 

high-speed I/O and HP-PB bus slots. A number of EISA cards 

are supported in the workstation system. The following is a 

partial list of the L'O cards available for J/K-class systems: 

■>Mbyte/s single-ended SCSI 

20-Mbyte/s fast-wide SCSI 

IIP Fiber Link 

IEEE 802.3 LAN 

IEEE 802.5 token ring 

FDDI 

FibreChannel 

ATM ( asynchronous transfer mode) 

Programmable serial interface 

Bitronics parallel port 

16-port serial RS-232 

32-port serial RS-232 

2D graphics card 

3D graphics card. 

Integrated Peripherals 

The server systems all have a standard DDS tape drive and a 
CD-ROM drive integrated into the system. In addition, there 
is space available for up to four 20-Mbyte/s SCSI disk drives 
in the system box. The workstation comes with a standard 
3.5-inch flexible disk drive, a CD-ROM drive, a tape drive, 
and I wo slots for 20-Mbyte/s SCSI disk drives built into the 
system box. 

Industrial Design 

The J/K-class industrial design is intended to convey a strong 
perception of the power within, wrapped in bold, distinctive 
designs that are elegant and pleasing to the eye. The K-cIass 
product is designed to work as a floor-standing product as 
well as rack-mounted in an industry-standard IIP 19-inch 
EIA rack (Fig. 8). A growing number of rack-mounted I IP 
peripheral products such as disk arrays, uninterruptible 
power supplies, and LAN hubs complement the overall sys- 
tem. The J-class system (Fig. 9) is designed for floor-standing 
use in the commercial workstation environment, but can be 
rack-mounted on a custom basis. 

These machines were designed with ease of assembly and 
serviceability as high priorities. They use plastic pails that 
snap together over a riveted steel chassis without a single 
screw or fastener, making assembly and disassembly very 
quick and easy for service and for the eventual recycling at 
the end of the products' life. 

Customer ease of use was another design priority. This is 
evident in the brightly backlit liquid crystal display, which 
conveys system status information in a clear text font, a vast 
improvement over previous systems, which had flashing 
LEDs. A simple three-position keyswitch for on, off, and 
service mode is dearly marked and positioned within easy 
reach on the front of the K-class system. The front door 
gives the user easy access to peripherals and visual feedback 
in the form of disk activity lights. Inside the front door are a 
pocket for the user manual, a safe storage location for the 
System key, and a system label with the most pertinent user 
information. 

Extensive effort went into label design, working with field 
support engineers, to make these products the leaders in 



titeir class in ease of installation, serviceability, and field qp- 
gradability. The labels use color coding and detailed diagrams 
clearly defining such things as board locations, memory 
SIMM loading sequences, and disk locations. These have 
been very successful in making the many configurations 
easily understood by customers and HP manufacturing and 
field service personnel. 

Server Package Design 

Like even other aspect of the design of the J/K-class sys- 
tems, designing the chassis and plastics proved to be chal- 
lenging. With a strong emphasis on development schedule 
and a desire for a very robust and flexible design, the engi- 
neering team had to create some innovative solutions to 
keep on schedule and keep the cost of the product low. 

Several requirements defined the maximum height and 
width of the server box. It had to fit into a 19-inch rack, so it 
could be no wider than 17.3 inches and no deeper than 25 




Fig. 8. K-class server configurations. 
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Fig. 9. J-dass workstation system processing unit. 

inches. It could be no taller than 25.2-1 inches, so that as a 
standalone unit it would fit under a standard table. 

An additional challenge was that of the Runway bus. The 
expected high speeds of the bus required that the bus length 
be kept to a minimum to reduce signal propagation delays. 
At the same time, up to six different components needed to 
attach to this bus: four processors, the master memory con- 
troller (MMC), and the I/O adapter. The processor module 
spacing was kept to 1.2 inches, allowing the overall length of 
the Runway bus to be short enough to support reliable 
120-MHz operation. 

This small size also presented an additional challenge, I hat 
of cooling the many components in the system. An intensive 
effort was launched to simulate and create mockups of the 
proposed mechanical designs for airflow and expected inter- 
nal system temperature rises. A number of cooling alterna- 
tives were proposed and evaluated. The resulting solution 
provides an air-cooled system with excellent airflow and a 
remarkably quiet system for the power dissipated within the 
box. There are only two fans in the entire system. 

Cither desired features helped to define how the system was 
partitioned into the various circuit boards. Each board 
needed ro be easy to access. Almost every component in the 
system can be accessed and removed from the system for 
maintenance or repair in a matter of minutes. The size and 
type of add-on I/O cards also required some creative design 
to allow flexibility in the design as well as flexibility for ihe 
customer. 



power monitor, and Ihe uninterruptible power supply (TIPS). 
The power features implemented in the workstation and 
server systems are slightly different but follow the same 
general philosophy: provide reliable power to the system at 
a low cost. 

The server system power supply provides the voltage rails 
for the components in the system. This 925-watt supply in- 
corporates power factor correction and accepts a wide 
range of input voltages and input frequencies between 50 
and 60 Hz. It provides a carryover time of 20 milliseconds 
after a power failure. 

The system power supply does not include any intelligence 
to control the system turn-on or reset activities. This intelli- 
gence is provided by the system power monitor (Fig. 10). 
This circuit monitors the various aspects of the computer 
system and the power supply output to determine if the 
power should be turned on or off. This includes monitoring 
the system internal temperature, checking the voltage out- 
put to ensure that there is no nndervoltage or overvoltage 
condition, and providing diagnostic messages on the system's 
liquid crystal display when problems occur. 

The uninterruptible power supply (UPS) is an optional com- 
ponent of the J/K-tiass systems. It provides additional assur- 
ance of system availability and data integrity, even if the ac 
power lines fail for any reason. Upon a powerfail event, the 
UPS provides ac power to the system for up to 15 minutes, 
allowing Ihe system to continue operation, or in the case of 
an extended power outage, to shut down gracefully and save 
critical data to disk. When ac power returns, the system will 
continue operation, or if it was shut down, it can be restarted 
without loss of data. 

More details on the power Supply, monitor, and UPS can be 
found on page 16. 

System Performance 

The J/K-cl&ss systems were developed to provide customers 
with excellent performance in the intended markets: mid- 
range servers and high-end workstations. Our goal was not 
necessarily to provide the highest single-component perfor- 
mance, but to provide customer-valued application perfor- 
mance at an extremely attractive price. 




Power System 

The power system for the J/K-class systems can be split into Fi *5- 10 - s > s,enl l" lUi r 
three subsystems: the system power supply, the system 
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The most common component benchmark is the SPEC 
(Systems Performance Evaluation Cooperative) suite, which 
measures CPU integer and floating-point performance. For 
these benchmarks, the processor in the J/K-class products 
provides about 168 SPECint92 integer and 258 SPECfp92 
floating-point performance at 120 MHz. With the well- 
balanced symmetric multiprocessing J/K-class systems, the 
SPECrate_int performance of a four-processor 120-MHz 
system is 12,150 and the SPECrate_fp92 performance is 
19.600. 

It is in the real-world applications and at the system level 
where the J/K-class computer systems really start to shine. 
The balanced design of the Runway processor-memory bus. 
the memory subsystem, and the performance I/O system 
provides the user with exceptional performance. Two widely 
used benchmarks that try to measure the performance of 
realistic customer workloads are SPEC SFSl.OfLADDIS) 
and TPC-C. The 120-MHz LADDIS performance of the J/K- 
class is as high as 4750 I/O operations per second, which 
exceeds many high-end servers that typically have twice as 
many processors (8 to 10 processors compared to four for a 
Model K400). A four- way J/K-dass serv er at 120 MHz has 
demonstrated in excess of .'1.000 transactions per minute on 
the TPC-C transaction benchmark. At the time of introduc- 
tion of the J/K-class systems, the only other single system 
with higher performance was HP's own T500 corporate 
business server. 

On the technical side, workstation applications clearly bene- 
fit from the increased memory bandwidth. At the same time, 
the introduction of multiprocessing in a high-end client 
configuration provides the opportunity for cither parallel 
processing of a single task or more parallel execution for 
multiple tasks. With the addition of the new high-end IIP 
Visualize48 graphics, for which the J-class systems provide 
some specific hardware performance enhancers, I he work- 
station products w ill handle large, complex design and 
visualization problems easily. 

Design for Lasting Value 

The J/K-class systems were designed to provide HP's custom- 
ers with lasting value. Processors can be easily added to the 
system, to a maximum of two processors in the workstation 
systems and up to lour processors in the server systems. 
I pgrading from 100-MHz to 120-MHz processors is just as 
simple. The J/K-class systems are also designed to accept 
future processors easily, such as the PA 8000 processor,"' 
through a simple processoi module upgrade. 

Not only are processors easy to upgrade, but memory anil 
I/O are also designed so that it is easy to add memory and 
I/O funct ionality. Memory can be added in 32M-byte or 
128M-byte increments up to 1024M bytes in a J-class system 
or up to 2048M bytes in a K-class or Series !ix!»KS system. As 
increased-densilv DRAMs become cost -effective, memory 
limits will increase to 3.75G bytes of main memory, filling 
most users' memory configuration and capacity requirements 
far into the future. 

System Verification 

The design of any computer system requires an extensive 
test and verification effort. For the chips and boards de- 
signed fur the J/K-class platforms, many engineer-months 



were dedicated to ensuring the systems manufactured and 
shipped to HP's customers are of the highest quality and 
reliability. This testing can be grouped into several different 
categories: presilicon chip and system simulation, formal 
verification methods, system functional verification, chip and 
system electrical characterization, and system validation. 

Simulation. Before committing any part of the J/K-class de- 
sign to silicon, extensive simulation had already proven the 
basic functionality of each component individually and as 
part of the system. Each component design team provided a 
mi nlel of their particular part of the design to an overall sys- 
tem simulation team. The system simulation team then pulled 
together tools first to simulate subsystems and eventually to 
simulate the entire J/K-class system. In addition to the logical 
simulation to verify correct functionality, electrical simula- 
tion was done for the critical portions of the system such as 
clock distribution, system buses, and chip internal critical 
paths (see article, page 34). 

Formal Methods. For some parts of any design, it is very diffi- 
cult to verify complete adherence to design specifications. 
One area of concern in the J/K-class design was the bus pro- 
tocols for I he Runway bus. In an effort to reduce risk and 
improve system reliability, formal methods 4 were used to 
analyze the bus transaction protocols used in the Rim way 
bus definition. The analysis pointed to several defects, which 
were corrected before implementation of the system. 

Functional Verification. As the first components became avail- 
able to the design teams for initial debugging, efforts were 
focused on verifying that each component functioned prop- 
erly in the system. The first goal was to boot the system to 
the initial system loader. At this point either the operating 
system (HP-UX) could be loaded or system and component 
diagnostics could be loaded. While booting the operating 
system is a great accomplishment, the task of verifying cor- 
rect functionality was far from completed when this was 
done. 

Numerous tests were developed specifically for the J/K-class 
systems. These tests employed a number of techniques for 
finding defective components and defects in design. These 
techniques included pseudorandom and pseudoexhaustive 
code and data sequences that stressed the processors (inte- 
ger units, floating-point units, caches, program control, etc.), 
the memory and memory controllers, and the I/O has adapters 
and I/O controller cards. 

Electrical Characterization. Once a minimal level of system 
functionality was attained, several electrical characteriza- 
tion efforts were launched to prove that the components 
and I he system would function in the electrical environ- 
ment. This testing focused on measuring electrical noise on 
chips as well as boards, and looking at bus cross talk and 
power supply variation and noise. Systems were stressed 
beyond normal temperature ranges, voltage ranges, and fre- 
quency ranges to find the weakest link in the system electri- 
cal environment. Through all this characterization effort, 
designs were modified and improved, resulting in a system 
that is capable of running reliably throughout the specified 
system operating environment. 

System Validation. Because of the desire to stress the system 
beyond what HP's customers will do, the functional and 
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K-Class Power System 



The power system in [he HP 9000 K-class servers uses a number of new and 
emerging technologies to achieve excellent platform performance without com- 
promising cost, reliability, and quality metrics Combined in the power system are 
the system power monitor, the system power supply, and an optional uninterrupt- 
ible power supply (UPS) Key contributions of the system power monitor include 
system turn-on and initialization including error reporting via a front-panel LCD 
display, temperature monitoring and cooling, fan speed control based on ambient 
temperature, Ian synchronization and fault detection, continuous power supply 
output voltage monitoring, special manufacturing modes of operation, overtem- 
pRratnrn detection and warning, oveiteinyeiature shutdown, and other features. 
The system power supply uses power factor correction to achieve low power-line 
distortion while maximizing the available VA capacity of the input ac circuit. A 
standard dc-to-dc forward converter follows the regulated power factor corrected 
output Remote sensing is used on all output rails to achieve tight regulation 
specifications. The power system is optimized for use with several HP UPSs 
employing both offline and online technologies. The UPSs use an autorangmg 
technology allowing worldwide use Worldwide regulatory and safety approvals 
apply to these UPSs The hardware provides power-line filtering and conditioning 
while the firmware provides many useful status and control capabilities, both 
real-time and programmed for later execution. 
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Fig. 1. System power monitor black diagram 



System Power Monitor 

The system power monitor (Fig 1 1 is where the power system gets Us HP person- 
ality It was intended that most if not all nonstandard features of the power sys- 
tem would be concentrated in this assembly, as opposed to having them in the 
power supply itself The power monitor is designed around a microprocessor, so 
that most of its features are determined by firmware. This made it convenient to 
modify these features as required during the system development phase, without 
changing hardware The power monitor is powered by a dedicated +15Vdc supply 
which is turned on at all times if the sy3tem has ac uuwer Tne functions of the 
power monitor are 

• Check the CPU modules in the system to see that they are all compatible with 
each other 

• Check the power supply in the system to see that it is compatible with the CPUs 
present 

• Respond to system keyswitch position and turn power supply on and off as 
required 

• Monitor all power supply output voltages for valid range. 

• Monitor ambient temperature and initiate operator warnings. 

• Control fan speed as a function of ambient temperature. 

• Synchronize the two fans to avoid acoustic beating 

• Check for fan failure. 

• Monitor internal system temperature for valid range 

• Initiate system reset signals. 

• Issue ac powerfail warning signal 

• In case of any system malfunction, shut down the system and write a message to 
the front-panel liquid crystal display indicating why the system was shut down 

The notable contributions of the power monitor are its fan control scheme, which 
makes the system remarkably quiet for its oower level, and its contribution to 
system maintainability through diagnostic display messages. 

System Power Supply 

The system power supply is rated for 925W of continuous dc output Five output 
rails are provided: *3.30Vdc (V DL ). +1-4Vdc (V DH ), -5 'Vdc (Vrjrj), tIZVdc, and 
— 12Vdc. The +3.3Vdc and +5.1Vdc rails are used for standard logic circuits while 
the +4.4Vdc is used exclusively for the CPUs The +1ZVdc is used primarily fot disk 
drives and I/O with the remaining -lZVdc rail being used strictly for I/O. All rails 
have ±1.5% regulation windows. Additionally, a fISVdc, 300-mA rail is provided 
for use by the system power monitor. This rail is electrically isolated from the 
computer rails Its single point of ground is provided by the power monitor, which 
eliminates the potential for ground loops. The system power supply implementa- 
tion is done entirely In discrete devices with one hybrid, four daughter cards, and 
a 2 B-mm-thick, HP FR4 motherboard. The density is 1 B watts per cubic inch Both 
a discrete version and a dc-to-dc module approach were initially investigated, but 
cost, cooling, and reliability concerns ultimately resulted in the discrete version 
being chosen 



electrical characterization efforts mainly focused on lest 
software and environments that do not match our custom- 
ers' operating conditions. Wliile it is likely (hat all hardware 
defects (design related as well as manufacturing related) 
will be found with the methods shown above, it is not 
known if the new hardware might uncover software defects. 
At the same time, it is possible that actual system software 
and applications could uncover hardware defects. For this 
reason, each system is tested under various load conditions 
and system configurations while running actual HP-l'X and 
MPE/iX application programs and system exercisers. These 
efforts result in a system that has been designed to operate 
reliably in normal operating conditions as well as under 
extremes of environment. 



Conclusion 

The J/K-class family of workstations and servers takes a big 
step in the direction of converging HP's workstation and 
server lines. At the same time, the J/K-class provides leader- 
ship performance at exceptional value to computer systems 
users. 
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The It-class pnwtl requirements required the maumum allowable VA capacity of 
a 1007120Vac 15A branch circuit. To hilly utilise the 15A circuit and no* require 
customer installation of a 20A branch circuit it was decided earty m tne develop- 
ment cycle that the system power supply would use power factor correction This 
meant that the input voltage and current waveforms are m phase, so the power 
supply appears to be a resistive load on the ac line By comparison, traditional 
offline switchers appear to the ac line as peak detectors and thus there aie very 
large "spikes* of input current at the peaks of the input voltage waireform With 
the system power supply appearing resistive, power >s Drawn out ot trie sc line 
continuously rather than just at voltage peaks. Without power factor correction, 
typical offline switchers are limited to approximately 600W given lOOVac input 
lines Power factor correction also allows the supply to operate over a wide range 
of input voltages without requiring any additional circuitry such as autorangmg 
circuitry or line select switches. The regulated output voltage of the power factor 
correction circuit is -400Vdc The supply is rated to operate with input voltages 
from 90Vac to 140Vac for its lower operating range and from !80Vac to 2f34Vac ttl 
its higher operating range The frequency range of operation in either voltage 
range is 50 to 60 Hz There is a minimum guaranteed carryover time of 20 ms 
before a powerfail warning is issued and an additional 5 ms of carryover time 
after a powerfail is issued Another benefit of using power factor correction is that 
European norms for line distortion are already met when they become mandated 
in the European Community. 

Two forward dc-to-dc converters are employed in the power supply Both convert- 
ers take the regulated -t40QVdc output of the power factor correction stage and 
convert it to the desired regulated output voltage With the Vqd rail exceeding 
500W it made sense, considering component selection and cost, to have Vno 
generated by one converter and the remaining rails by a second converter which is 
rated at about 425 watts The use of two converters also allows sequencing of 
the Vod rail with respect to the Vqh rail, which was a semiconductor requirement 

Two output connectors are required for busing power between the power supply 
and the system board The footprint of the connectors measures only six square 
inches, so the impact of the power system on the system board layout was 
minimal 

Uninterruptible Power Supplies 

Unlike many previous HP systems which used battery backup of only main memory 
during short duration ac power failures, thus halting any processes in progress, 
the K-class power supply uses uninterruptible power supplies (UPSsI for backup 
This allows uninterrupted operation during an ac line failure for some predeter- 
mined period ol time after which the computer can be automatically and control- 
lably shut down Should the power be restored before shutdown is required, 
processing will have continued uninterrupted Should shutdown be required 
because of an extended power loss then the computer can do a controlled shut- 
down programmatically. after which the UPS can be shut down This controllable 
turn-off of the UPS and host computer is well-suited for applications in which 



customers want to reduce their energy consumption by shutting down equipment 
programmaticailv overnight or over the weekend 

Two UPS technologies arB available from HP The lower-power units — 600VA 
I425W) standalone, 130QVA H300WI standalone, 1 3-kVA I1300W1 racbtxwm. and 
t 8-kVA I '800WI rackmouni— all employ offline technology The UPS directs the 
incoming ac directly to the load being supported jnless the input falls outside of a 
defined set of voltage and frequency HUB Once this occurs the UPS then 
switches to inverter mode ana outputs regulated ac using its internal battenes and 
a dc-to-ac upconverter This technology is very effective, reliable, cost-competitive, 
and efficient for many applications in which a defined loss of ac input for the load 
can be supported The time period during which there is no ac input is defined as 
transfer time. The offline units have a transfer time of 10 ms maximum and this 
maps well into HP's computer products which have a guaranteed carryover time of 
20 ms minimum. The offline topology is very energy efficient, when the ac input is 
within tolerance the UPS is just maintaining its internal batteries Unless the bat- 
teries have been run down because of an earlier power failure tne batteries are in 
a "float" state and require very little input power 

The topology employed by the high-power 3-kVA UPS is online interactive In this 
technology the UPS monitors the incoming ac waveform and ad|usts it on a cycle- 
by-cycle basis, interactively regulating the output ac to the host computer system 
Should the line deviate substantially outside of its normal range the UPS transfers 
from online to inverter mode and continues to provide the load with regulated ac 
derived from the UPS's internal batteries. This technology provides excellent regu- 
lation of the ac output supplied to the load under all line conditions and is suitable 
for mission-critical applications where even slight losses of ac input are disruptive 
The 3-kVA UPS also provides isolation from line for ground-loop-sensitive products 
by means of an isolation transformer. This topology is also very energy efficient 
because the majority of the losses during normal running are localized in the isola- 
tion transformer With proper choice and design these losses can be greatly re- 
duced resulting in a very efficient design 

HP's offline units are autoranging in both voltage and frequency and have world- 
wide safety and regulatory recognition This feature allows worldwide coverage 
with |ust one model per power range. These units have 15 minutes of run time at 
rated load rather than the industry standard of 7 to 8 minutes The software feature 
set includes programmable on and off times, input voltage, input frequency, output 
voltage, and battery voltage. UPS internal temperature monitoring, self -test mode, 
and numerous other status and warning codes 

HP's online 3-kVA unit provides regulated 230Vac output at either 50 oi 50 Hz. It 
provides 3 kVA or 3 kW of output, allowing full utilization with power factor 
corrected loads. 

GeialdJ Nelson 
James K Koch 
Development Engineers 
Systems Technology Division 



leadership, and direction throughout the projen. Addition- 
ally, we acknowledge the contribution of numerous engi- 
neers, program managers, technicians, and support person- 
nel from one end of the country' to the other for the many 
hours spent in the design and testing of this system. Special 
thanks to Chris Christopher for clearing obstacles and urging 
us to reach for the best product we could design. Confirming 
I he quality of their industrial design, the J/K-class systems 
won an award at I he 1!'S)5 iF ( Industrie Korum Design Han- 
over), the world's largest product industrial design forum. 
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A High-Performance, Low-Cost 
Multiprocessor Bus for Workstations 
and Midrange Servers 

The Runway bus, a synchronous, 64-bit, split-transaction, time 
multiplexed address and data bus, is a new prnressor-memory-1/0 
interconnect optimized for one-way to four-way symmetric 
multiprocessing systems. It is capable of sustained memory bandwidths of 
up to 768 megabytes per second in a four-way system. 

by William R. Bryg, Kenneth K. Chan, and Nicholas S. Fiduccia 



The HP 9000 K-class servers and J-class workstations are Hie 
first systems to introduce a low-cost, high-performance bus 
Structure named the Runway bus. The Runway bus is a new 
proeessor-niemory-I/O interconnect that is ideally suited Cor 
one-way to four-way symmetric multiprocessing Cor high- 
end workstations and midrange servers. It is a synchronous, 
64-bit, split-transaction, time multiplexed address and data 
bus. The HP PA 7200 processor and the Runway bus have 
been designed Cor a bus frequency of 120 MHz in a four-way 
multiprocessor system, enabling sustained memory band- 
Widths of up to 708 Mbytes per second without external 
interface or "glue" logic. 

The goals for the design of the Runway protocol were to 
provide a price/performance-competitive bus for one-way to 
four-way multiprocessing, to minimize interface complexity, 
and to support the PA 7200 and future processors. The Run- 
way bus achieves these goals by maximizing bus frequency, 
pipelining multiple operations as much as possible, and 
using available bandwidth very efficiently, while keeping 
complexity and pin count low enough so that the bus inter- 
face can be integrated directly on the processors, memory 
controllers, and I/O adapters that connect to the bus. 

Overview 

The Runway bus features multiple outstanding split transac- 
tions from each bus moduli', predictive flow control, an effi- 
cient distributed pipelined arbitration scheme, and a snoopy 
coherency protocol. 1 which allows flexible coherency check 
response time. 

The design center application for the Runway protocol is the 
HP 9000 K-class midrange server. Fig. 1 shows a Runway 
bus block diagram of the HP 9000 K-class server. The Run- 
way bus connects one to four PA 7200 processors with a 
dual I/O adapter and a memory controller through a shared 
address and data bus. The dual I/O adapter is logically two 
separate Runway modules packaged on a single chip. Each 
I/O adapter interfaces to the HP BSC I/O bus. The memory 
controller acts as Runway host, taking a central role in ar- 
bitration, flow control, and coherency through use of a spe- 
cial client -OP bus. 



The shared bus portion of the Runway bus includes a 04-bit 
address and data bus, master IDs and transaction IDs to tag 
;ill transactions uniquely, address valid and data valid signals 
to specify the cycle type, and parity protection for data and 
control. The memory controller specifies what types of 
transactions can be started by driving the special clienl-OP 
bus. which is used for flow control and memory arbitration. 
Distributed arbitration is implemented with unidirectional 
wires from each module to other modules. Coherency is 
maintained by having all modules report coherency on dedi- 
cated unidirectional wires to the memory controller, which 
calculates the coherency response and sends it with the 
data. 

Each transaction has a single-cycle header of 64 bits, which 
minimally contains the transaction type (TTYPE) and the 
physical address. Each transaction is identified or tagged 
with the issuing module's master ID and a transaction ID. 
die combination of which is unique for the duration of the 
transaction. The master II > anil transaction II I are trans- 
mitted in parallel to the main address and data bus. so no 
extra cycles are necessary for the transmission of the mas- 
ter ID and transaction ID. 

The Runway bus is a split-transaction bus. A read transac- 
tion is initiated by transmitting the encoded header, which 
includes the address, along with the issuer's master ID and a 
unique transaction ID, to ;ill other modules. The issuing 
module then relinquishes control of the bus, allowing other 
modules to issue their transactions. When the data is avail- 
able, the module supplying the data, typically memory, arbi- 
trates for die bus, then transmits the data along with the 
master ID and transaction ID so that the the original issuer 
Gail match the data with the particular request 

Write transactions are not split, since the issuer has die data 
that it wants to send. The single-cycle transaction header is 
followed immediately by the data being written, using the 
issuer's master ID and a unique transaction ID. 

Fig. 2 shows a processor issuing a read transaction followed 
immediately by a write transaction. Each transaction is 
tagged with the issuing module's master ID as well as a 
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Fig. 1. IIP itllOO K-eluss server Runway bus Mix k diagram. 

transaction ID. This combination allows the data response, 
tagged with the same information, to lie direct oil hack to the 
issuing module without the need for an additional address 
cycle. Runway protocol allows each module to have up to 04 
transactions in progress al one time 

Arbitration 

To minimize arbitral ion latency without decreasing maxi- 
mum bus frequency, the Runway bus has a pipelined, two- 
state arbitration scheme in which the determination of the 
arbitration winner is distributed among all modules on the 
bus. Each module drives a unii|iio arbitration rei|iiest signal 
and receives other modules' arbitration signals. On the first 
arbitration cycle, all interested parties assert their arbilra- 
tion signals, and the memory controller drives the client-* >P 
control signals (see Table I) indicating flow control Informa- 
tion or whether all modules are going to be preempted by a 
memory data return. During the second cycle, all modules 
evaluate the information received and make Ihe unanimous 
decision about who has gained ownership of the bus. ( In the 
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Table I 
Client OP Bus Signals 

ANY_TRANS Any transaction allowed 

N0J0 Any transaction allowed except CPI I 

to I/O 

RETURNS.ONLY Relurn or response transactions 

allowed 

ONE.CYCLE Only one-cycle transactions allowed 

NONEJVLLOWED No transactions allowi'd 

MEM_C0NTR01 Memory module controls bus 

SHARED_RETURN Shared data return 

ATOMIC Atomic owner can issue any trans- 

action: other modules can only issue 
response transactions. 

third Runway cycle, the module that won arbitration drives 
the bus. 

With distributed arbitration instead of centralized arbitra- 
tion, arbitration information only needs to flow once be- 
tween bus requesters: (Sing a centralized arbitration unit 
would require information to flow twice, first between the 
rei|uester and the arbiter and then between Ihe arbiter and 
the winner, adding extra latency to Ihe arbitration. 

Distributed arbitration on (he Runway bus allows latency 
between arbitration and bus access to be as short as two 
Cycles, < ini'i' a module wins arbitration, ii may optionally 
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assert a special lony transaction signal to extend bus owner- 
ship for a limited number of cycles for certain transactions. 
To maximize bus utilization, arbitration is pipelined: while 
arbitration can be asserted on any cycle, it is effective for 
the selection of the next bus owner t wo cycles before the 
current bus owner releases the bus. 

Arbitration priority is designed to maintain fairness while 
delivering Optimal performance. The highest arbitration 
priority is always given to the current bus owner through 
use of the long transaction signal, so that the current owner 
can finish whatever transaction it started. The second high- 
esl |n rorfly is given to the memory controller for sending out 
data returns, using the client-OP bus to take control of the 
Runway bus. Since the data return is the completion of a 
previous split read request, it is likely thai the requester is 
stalled waiting for the data, and the data return will allow 
I he requester to continue processing. The third highest ar- 
bitration priority goes to the I/O adapter, which requests I lie 
bus relatively infrequently, but needs low latency when it 
does. Lowest arbitration priority is the processors, which 
use a round-robin algorithm to take turns using the bus. 

The arbitration protocol is implemented in such a way that 
higher-priority modules tlo not have to look at the arbitra- 
tion request signals of lower-priority modules, thus saving 
pins and reducing costs. Aside effect is iliat low-priority 
modules can arbitrate for the hus faster than high-priority 
modules when the bus is idle. This helps processors, which 
are the main consumers of the bus, and doesn't bother the 
memory controller since it can predict when it will need the 
bus for a data return and can start arbitrating sufficiently 
early to account for the longer delay in arbitration. 

Predictive Flow Control 

To make the best use of the available bandwidth and greatly 
reduce complexity, transactions on the Runway bus are 
never aborted or retried. Instead, the client-OP bus is used 
to communicate what transactions can safely be initiated, as 
shown in Table L 

Since the Runw ay bus is heavily pipelined, there are queues 
in the processors, memory controllers, and I/O adapters to 
hold transactions until they can be processed. The client-OP 
bus is used to communicate whether there is sufficient room 
in these queues to receive a particular - kind of transaction. 
Through various means, the memory controller keeps track 
of how much room is remaining in these queues and restricts 
new transactions when a particular queue is critically full, 
meaning that the queue would overflow if all transactions 
being started in the pipeline plus one more all needed to go 
into that queue. Since the memory controller "predicts" 
when a queue needs to stop accepting new transactions to 
avoid overflow, this is called predictive flow control. 

Predictive flow control increases the cost of queue space by 
having some queue entries that are almost never used, but 
the effective cost of queue space is going down with greater 
integration. The primary benefit of predictive flow control is 
greatly reduced complexity, since modules no longer have to 
design in the capability of retrying a transaction that got 
aborted. This also improves bandwidth since each transac- 
tion is issued on the bus exactly once. 



A secondary benefit of predictive flow control is faster com- 
pletion of transactions that must be issued and received in 
order, particularly writes to I/O devices. If a transaction is 
allowed to be aborted, a second serially dependent trans- 
action cannot be issued until the first transaction is guaran- 
teed not to be aborted. Normally, this is after the receiving 
module has had enough time to look at the transaction and 
check the state of its queues for room, which is at least sev- 
eral cycles into the transaction. With predictive flow control, 
the issuing module knows when it wins arbitral ion that the 
first transaction will issue successfully, and the module can 
immediately start arbitrating for the second transaction. 

Coherency 

The Runway bus provides cache and TLB (translation look- 
aside buffer) coherence with a snoopy protocol. The proto- 
col maintains cache coherency among processors and I/O 
modules with a minimum amount of bus traffic while also 
minimizing the processor complexity required to support 
snoopy multiprocessing, sometimes at the expense of mem- 
ory controller complexity. 

The Runway bus supports processors with four-stale 
caches: a line may be invalid, shared, private-clean, or pri- 
vate-dirty. An invalid line is one that is not present in cache. 
A line is shared if it is present in two or more caches. A pri- 
vate line can only be present in one cache; it is private-dirty 
If ii has been modified, private-clean otherwise. 

Whenever a coherent transaction is issued on the bus. each 
processor or I/O device (acting as a third party) performs a 
snoop, or coherency check, using the virtual index and phys- 
ical address. Each module then sends its coherency check 
status directly to the memory controller on dedicated COH 
signal lines. Coherency status may be C0H_0K, which means 
I hat either the line is absent or the line has been invalidated. 
A coherency status of C0H_SHR means that the line is either 
already shared or has changed to shared after the coherency 
check. A thud possibility is C0H_CPY, which means the third 
party has a modified copy of the line and will send the line 
directly to the requester. Fig. 3 shows a coherent read trans- 
action that hits a dirty line in a third party's cache. 

After the memory controller has received coherency status 
from every module, it will return memory data to the re- 
quester if the coherency status reports consist of only 
C0H_0K or C0H_SHR. If any module signaled C0H_SHR. the 
memory controller will inform the requester to mark the line 
shared on the client-OP lines during the data return. If any 
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Fig. 3. A coherent read transac tion hits a dirty line in a third 
party's cache. 
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Fig. 4. I m ! • -tali:- transitions 
resulting from ("PI" instructions. 
The lines with arrows show the 
transitions of cache state at the 
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at other CPUs tliai caused tin? 
transition, as well as the effect on 
the other CPU's slate. For exam- 
ple, from the invalid state, a load 
miss will always cause a 
Raad_Shared_o/_Private transaction 
The final state for the load miss 
will be either private-clean (if an- 
other CPU had the cache line in- 
valid or private-flirty) or shared 
(if another CPU had the cache 
line Shared or pnvai.e-clean). 



module signals C0H_CPY, however, (he memory controller 
will discard (he memory data and wait for the third party to 
send the modified cac he line directly to the requester in a 
C2C_WRITE transaction. The memory controller will also write 
the modified tlata in memory so that the requester can mark 
the line clean instead of dirty, freeing the requester (and the 
bus) from a subsequent write transaction if the line has to 
be cast out. Fig. 4 shows cache state transitions resulting 
from CPU instructions. Fig. S shows transitions resulting 
from bus snoops. 

The Runway coherency protocol supports multiple out- 
standing coherency checks and allows each module lo sig- 
nal coherency status at its own rate rather than at a fixed 
latency. Each module maintains a queue of coherent trans- 
actions received from the bus to be processed in FIR) order 
al a time convenient for the module. As long as the coher- 
ency response is signaled before data is available from the 
memory controller, delaying the coherency check will not 
increase memory latency. This flexibility allows CPUs to 
implement simple algorithms to schedule their coherency 
checks so as (o minimize conflicts with the Instruction pipe- 
line for cache acc ess. 
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Virtual Cache Support 

Like all previous HP multiprocessor buses, virtually indexed 
c aches are supported by having all coherent transactions 
also transmit the virtual address bits that are used to index 
the processors' caches. The twelve least-significant address 
bits are the offset within a virtual page and are never 
changed when translating from a virtual to a physical ad- 
dress. Ten virtual cache index bits are transmitted: these are 
added to the twelve page-offset bits so that virtual caches up 
to 4M bytes deep (22 address bits) can be supported. 

Coherent I/O Support 

Runway I/O adapters take part in cache coherency, which 
allows more efficient DMA transfers. Unlike previous sys- 
tems, no cache flush loop is needed before a DMA output 
and no cache purge is needed before DMA input can begin. 
The Runway bus protocol defines DMA write transactions 
that both update memory with new data lines and cause 
other modules to invalidate dala that may still reside in their 
caches. 

The Runway bus supports coherent I/O in a system wilh 
virtually indexed caches. I/O adapters have small caches 
and both generate and respond to coherent transactions. 
The I/O adapters have a lookup table (I/O TLB) to attach 
virtual index information to I/O reads and writes, for both 
DMA accesses and control accesses. For more information 
see the article on page 52. 

Coherent I/O also reduces I he overhead associated wilh the 
load-and-ciear semaphore operation. Since all noninstrue- 
tion accesses in the system are coherent, semaphore opera- 
tions are performed in the processors" and I/O adapters' 
caches. The processor or I/O adapter gains exclusive owner- 
ship and atomically performs the semaphore operation in its 
own cache. If the line is already private in the requester's 
cache, no bus transaction is generated to perforin the opera- 
lion, greatly improving performance. The memory controller 
is also simplified because il does not need lo support sema- 
phore operations in memory. 



Fir. 5. i he stale transitions resulting from bus coherency 
checks (snnops). 
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Runway Bus Electrical Design Considerations 



The Runway bus's high bandwidth is a result of the strategies adopted lor its 
electrical design These included an efficient data transfer scheme, a simple clock 
system with low clock skew, a compact bus topology, and a termination strategy 
that eliminates dead cycles when changing bus masters 

Data Transfer Scheme 

The simple data transfer strategy, shown in Fig 1 , allows most of the cycle to be 
used to transfer data An edge-triggered Runway pad driver is enabled by the 
rising edge of the on-chip Runway clock, RCK, causing the data to be driven ontn 
the external bus This driven data is then latched one cycle later at the receiving 
devices by the next rising edge of the receiver's on-chip Runway clock On each 
Runway VLSI chip, the same physical clock edge is used to trigger the signal driver 
and latch the data from the previous cycle in the signal receiver. 

The following two equations express timing constraints that must be met for 
proper operation- 



Setup time equation 
Hold time equation. 



DRIVE m „ + SKEW + SUsT per „, d 
SKEW 4 HOLD < DRIVE m ,,„ 



where DRIVE is the delay from the rising edge of RCK at the driver to the time 
when the data is valid at the receiver. SU is the receiver selup time, HOLD is the 
receiver hold time. SKEW is the maximum skew of the clock signal (RCK) between 
the driver of one chip and the receiver of another, and T |1C „ M is the clock period 

Clock Path 

The Runway clock orchestrates the transfer of information among the components 
on the bus The path of the Runway dock to the driver and receiver circuits can be 
divided into three components on-board clock generation and distribution to the 
VLSI chip inputs, on-chip clock reception and buffering, and on-chip clock distribu- 
tion to the Runway driver and receiver circuits 

Skew can be introduced by any of these components Inspection of the setup and 
hold time equations reveals that it is desirable to reduce skew to as small a value 
as possible 

The clock path begins at the custom VLSI clock generation chip. This chip gener- 
ates several differential pairs of clock outputs, one per Runway VLSI chip. By using 
one chip as the source of the clock signals in this system, the output-to-output 
skew was kept very small. 
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Fig. 1. The data transfer timing strategy lot the Runway bus allows most of the cycle 
to be used to transfer data. 



Each dedicated clock pair is carefully routed on the printed circuit board to its 
Runway VLSI chip The traces are adjusted in length so that the arrival time of 
each clock at the pins of the Runway VLSI chips can be accurately placed with 
respect to the others Because of known timing differences in the paths of the 
clock from input pin to driver and receiver circuits for each type of Runway VLSI 
chip, it is useful to be able to tune the clocks in this manner. More will be said 
about this later 

Each differential clock signal is received at each chip bv a receiver/huffer circuit 
that transforms the signal into a single-ended signal RCK with normal CMOS 
voltage levels This RCK signal then fans out to all the Runway signal driver and 
receiver circuits located at the pads and the associated interface circuitry located 
in the core of the chip Since the interface circuitry is similar on all three types of 
Runway VLSI chips, the capacitive loading on RCK is nearly identical for all three 
types, which ensures that the delay through the clock buffer is similar for all Run- 
way VLSI chips 

The RCK signal is routed using various techniques to reduce the distribution delay 
and thus the variation in delay. The clock receiver/buffer bit slice is centrally 
placed in the interface so that the total distance the RCK signal must travel on- 
chip to the farthest signal bit slice is minimized. This clock is routed in wide metal 
so that the delay along this line is low. The signal pad ordering in the Runway 
interfaces for all of the Runway VLSI chips is nearly identical This ensures that 
the distance from the clock buffer to a signal pad is the same for all of the Runway 
VLSI chips. 

The goal in the design of the Runway clock system was to have the on-chip clock 
RCK arrive at the corresponding signal driver or receiver at the same time at each 
Runway device. Since the CPU is fabricated in CM0S14, a faster technology than 
the CM0S26 process used for the I/O adapter and memory controller chips, the 
on-board clock signal to the CPU is delayed to account for this known timing dif- 
ference. Thus the clock skew is only a function of the CMOS26 parameters, which 
keeps the skew to a minimum. 

Overall, the total chip-to-chip clock skew on RCK at the signal driver and receiver 
circuits is under 1 1 ns worst-case. 

Bus Topology 

The components on the bus are designed to be close together to limit the capaci- 
tive and inductive load on each Runway signal line. The setup and hold time equa- 
tions can be used to determine how best to lay out the signal path The require- 
ments of the two equations sometimes conflict For example, the setup time 
equation wants us to minimize DRIVE ma , and the hold time equation wants us to 
maximize DRlVE m ,„ In plain English, we want an interconnect scheme that mini- 
mizes the overall trace length while maintaining the greatest separation between 
components. 

An ideal connection topology would be a star with the devices placed at the tips 
of the star as shown in Fig. 2a. Because of manufacturing difficulties with this 
topology, the modified star shape shown in Fig. 2b, which fits comfortably using 
standard printed circuit board technology, was chosen As the figure suggests, the 
main trunk of the Runway bus consists of a standard printed circuit trace running 
along a backplane with at most four daughter cards attached to the backplane, 
two per side Each daughter card will hold one CPU The memory controller and 
the I/O adapter reside on the backplane along with the clock generation circuitry. 
This connection scheme interconnects six Runway devices with less than 9 inches 
of total printed circuit trace for the longest signal with no two devices farther 
apart than 4 5 inches 



Tlic Runway bus lias both full-line (32-byte) and half-line 
(16-byte) DMA input transactions, called WRITE_PURGE and 
WRITE1B_PURGE. Both transactions write the specified amount 
of data into memory at the specified address, then Invalidate 
any copies of the entire line that may be in a processor's 
cache. The full-line WRITE_PURGE is the accepted method for 



DMA input on systems that have coherent I/O, if the full line 
is being written. The hair-line WRITE16_PURGE is used for 
16-byte writes if enabled via the fast DMA attribute in the 
I/O TLB. Software programs the I/O TLB with the fast attrib- 
ute if it knows that both halves of the line will be overwrit- 
ten by the DMA. Otherwise, if the I/O TLB does not specify 
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Termination Strategy 

A parallel termination sttategv using external resistors rs usually used loi high- 
speed buses so thai incident wave switching can be employed, that is, the re 



needs to drive m one direction The termination resistor drives me bus the other 
way when the driver instates v turns oft While the duvet is on. duett current 
flows constantly When the driver turns of), the bus is disturbed by the change ot 
current though the inductive traces and bond wires This disturbance sends a 
wave propagating down the transmission line in the direction opposite to the 
direction of propagation when the driver turns on A special frequency-limiting 
case is the master changeover, when a driver at the end ol the bus starts to drive 
the same value that was anven by the master at the other end of the bus in the 
previous cycle. In this case, constructive interference of the two propagating 
waves may cause the bus to take a long time to settle It is not uncommon to 
insert a dead cycle in the protocol to allow extra time tor the bus to settle when 
the bus changes masters 

On a series-terminated bus. the bus driver has the ability to drive in both direc- 
tions The on-impedance of the driver transistor acts as the termination resistor 
The driver transistor will turn on and drive the bus to the desired level Near the 
end of the cycle when the bus is rieanng its final value, me drivers will be sourc- 
mg or sinking only a small fraction ot their peak currents at the start of the cycle 
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Fig. 2. la) Star lopulugy lb) Modified stai topology of the Runway bus 



Because of this when the driver is disabled at the end of the cycle, there ;s very 
little disturbance in the line This makes it possible to have another driver master 
the bus m the very next cycle However, because the on impedance of me duvet is 
not well controlled, the receiver must usually wail to recetve the reflected wave 
which increases the bus propagation delay 

Runway bus topology is very compact This means that the lime ditteience be 
tween the arrivals of the incident and reflected waves is relatively small compared 
to the bus cycle time Had we employed parallel termination on me Runway bus. 
we might have been able to increase the frequency of the bus by about 20V 
Since me dead cycle would have cost us about 20 to 30 percent in bandwidth, we 
decided to use series termination instead 

Other advantages of senes-termniated buses include good tolerance to impedance 
mismatches and long stubs and no dc power dissipation Also, we saved valuable 
board space by not having to place resistors on each Runway bus signal 

Simulated and Characterized Performance 

Early in Ihe Runway bus simulations, it became clear that the result would be 
dependent not only on which driver drove me bus. but also on the current stale of 
the bus The stale of the bus is precisely determined by Ihe history ol drivers 
driving the bus along with tfie starling condition of the bus Since the curreni state 
of the bus is mostly determined by who was lasi driving it. all possible pairs of 
successive bus transactions by two drivers were simulated The symmetry of the 
bus and our ability to predict and eliminate combinations that would not be worst- 
case helped cut down the number of slow-case simulations lo 32 A network of 
fast HP 9000 Model 720 and 750 workstations was able to run these simulations, 
each of which normally lakes one machine about one hour to run. in about 
4 hours 

The SPICE model lo simulate the worst case was in constant revision as more and 
more details liom the design were implemented The final model had artwork- 
extracted transistor models for ihe signal driver and receiver for each Runway 
VLSI chip and a detailed schematic model for each package trace and board con- 
nector The printed circuit traces were modeled using SPICE transmission-line 
primitives 

The final simulated worst case bus fiequency came in at 152 MHz using a fully 
loaded Runway bus The characterized frequency ol Hie bus over the extremes ol 
process, temperature, and voltage showed operation of at least 140 MHz The 
maximum characterization frequency was limited In M0 Mil/ because of the 
limitations of other system components These results gave us the confidence to 
conclude that Ihe Runway bus will woik at ihe 120 MHz fiequency goal with 
sufficient manufacturing margin 
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the I;lsi allriliiile. tin- l/< > adapter uses tin- slower read pri- 
vate, merge, Write hack transaction, which wQ] safety merge 

I lie I >MA ilala with any (liny processor data. The use of 
WRITE 16 PURGE (/.really increases DMA input bandwidth Tor 
older, legacy I/O cards thai use 16-byte Mocks. 



Design Tradeoffs 

To got the best performance from a low-cosi interconnect , 
Ihe lius designers chose a lime multiplexed lius. so thai the 
same pins and w ires can lie used for hoih address and dala. 
Separate address and dala buses would hav e increased the 
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number of pins needed by about 50%, but would liave in- 
creased usable bandwidth by only 2096. Since the number of 
pins on a chip has a strong impact on chip cost, the time 
multiplexed address and data bus gives us the best trade-off. 
A smaller number of pins is also important to allow the bus 
interface to be included on the processor chip instead of 
requiring one or more separate transceiver chips. 

To get the best bandwidth, the designers targeted for the 
highest bus frequency that could be achieved without requir- 
ing dead cycles for bus turnaround. The use of dead cycles 
would have allowed a higher nominal frequency, but dead 
cycles would have consumed 20 to 30 percent of the band- 
width, for a net loss. 

Bandwidth Efficiency 

The Runway bus has a rated raw bandwidth of 060 Mbytes/ 
second, which is derived by taking the width of the bus and 
multiplying by the frequency: 64 bits X 120 MHz -r 8 bits/ 
byte = 960 megabytes/second. However, raw bandwidth is 
an almost meaningless measure of the bus, since different 
buses use the raw bandwidth with greatly differing amounts 
of efficiency. Instead, buses should be compared on effec- 
tive or usable bandwidth, which is the amount of data that is 
transferred over time using normal transactions. 

To deliver as much of the raw bandwidth as possible as us- 
able bandwidth, the designers minimized the percentage of 
the cycles that were not delivering useful data. Transaction 
headers, used to initiate a transaction, are designed to fit 
within a single cycle. Data returns are tagged with a sepa- 
rate transaction ID field, so that data returns do not need a 
return header. Finally, electrically, the bus is designed so 
that dead cycles are not necessary for master changeover. 
The only inherent overhead is the one-cycle transaction 
header. 

For 32-byte lines, both read and write transactions take ex- 
actly five cycles on the bus: one cycle for the header and 
four cycles for data. It doesn't matter that the read transac- 
tion is split. Thus, for the vast majority of the transactions 
issued on the bus, 80% of the cycles are used to transmit 
data. Ttie effective bandwidth is 960 Mbytes/s x 80% = 
768 Mbytes/s. which is very efficient compared to competi- 
tive buses. 

In addition, the Runway bus is able to deliver its bandwidth 
to the processors that need the bandwidth. Traditional buses 
typically allow each processor to have only a single out- 
standing transaction at a time, so that each processor can 
only get at most about a quarter of the available bandwidth. 
Runway protocol allows each module — processor or I/O 
adapter — to have up to 04 transactions outstanding at a 
time. The PA 7200 processor uses this feature to have multi- 
ple outstanding instruction and data prefetches, so that it 
has fewer stalls as a result of cache misses. When a proces- 
sor really needs the bandwidth, it can actually get the vast 
majority of the available 708-Mbyte/s bandwidth. 

High Frequency 

The Runway protocol is designed to allow the highest pos- 
sible bus frequency for a given implementation. The proto- 
col is designed so that no logic has to be performed in the 



same cycle that data is transmitted from one chip to another 
chip. Any logic put into the transmission cycle would add to 
the propagation delay and reduce the maximum frequency 
of the bus. From a protocol standpoint, for this to work, 
each chip will receive bus signals at the end of one cycle, 
evaluate those signals in a second cycle and decide what to 
transmit, then transmit the response in the third cycle. 

To maximize Implementation frequency, the Runway bus 
project took a system-level design approach. All modules on 
the bus and the bus itself were designed together for opti- 
mal performance instead of designing an interface specifi- 
cation permitting any new module to be plugged in as long 
as it conforms to the specification. We achieved a higher- 
performance system with the system approach than we 
could have achieved with an interface specification. 

The I/O driver cells for (he different modules were designed 
together and SPICE-simulated iteratively to get the best per- 
formance. Since short distances are important, the pinouts 
of the modules are coordinated to minimize unnecessary 
crossings and to minimize the worst-case bus paths. See 
page 22 for more information on the electrical design of the 
Runway bus. 

The bus can be faster if there are fewer modules on it, since 
there is less total length of bus and less capacitance to drive. 
The maximum configuration is limited to six modules — four 
processors, a dual I/O adapter, and a memory controller — to 
achieve the targeted frequency of 120 MHz. 

Another optimization made to achieve high bus frequency is 
the elimination of wire-ORs. By requiring that only one mod- 
ule drive a signal in any cycle, some traditionally bused sig- 
nals that require fast response, such as cache coherency 
check status, are duplicated, one set per module. Other 
bused signals that do not require immediate response (e.g.. 
error signals ) are more cost -effectively transformed into 
broadcast transactions. Adapting the Runway protocol to 
eliminate wire-ORs allowed us to boost the bus frequency by 
10 to 20 percent. 

Acknowledgments 

The authors would like to acknowledge the contributions of 
many individuals who participated in the definition of the 
Runway bus protocol: Robert Brooks. Steve Chalmers, 
Barry Flahive. David Fotland, Craig Frink, Hani Hassoun, 
Tom Hotchkiss, Larry McMahan. Bob Naas, Helen Nusbaum, 
Bob Odineal. John Shelton. Tom Spencer. Brendan Voge, 
John Wickeraad. Jim Williams. John Wood, and Mike Ziegler. 
In addition, we would like to thank the various engineers 
and managers of the Computer Technology Laboratory, the 
General Systems Laboratory, the Chelmsford Systems Labo- 
ratory, and the Engineering Systems Laboratory who helped 
design, verify, build, and test the processors. I/O adapters, 
and memory' controllers necessary to make the Runway bus 
a reality, 

Reference 

1. P. Stpnstrom. "A Survey of Cache Coherence Schemes for Multi- 
processors," IEEE Computer, Vol. 23, no. 6. June 1990, pp. 12-25. 



24 February 1898 Hewlett-Packard .loumal 

©Copr. 1949-1998 Hewlett-Packard Co. 



Design of the HP PA 7200 CPU 



The PA 7200 processor chip is specifically designed to give enhanced 
performance in a four-way multiprocessor system without additional 
interface circuits. It has a new data cache organization, a prefetching 
mechanism, and two integer ALUs for general integer superscalar 
execution. 

by Kenneth K. Chan, Cyrus C. Hay, John R. Keller. Gordon P. Kurpanek, Francis X. Schumacher, and 
Jason Zheng 



Since 198G, Hewlett-Packard has designed PA-RISC' - pro- 
cessors for its technical workstations and servers, commer- 
cial servers, and large multiprocessor transaction processing 
machines. ' 9 'Hie PA 7200 processor chip is an evolution of 
the high-performance single-chip superscalar PA 7100 design. 

The PA 7200 incorporates a number of enhancements specif- 
ically designed for a glueless four-way multiprocessor system 
with increased performance on both technical and commer- 
cial applications. 10 "" On the chip is a multiprocessor system 
bus interface which connects directly to the Runway bus 
described in the article on page IS. The PA 7200 also has a 
new data cache organization, a prefetching mechanism, and 
two integer ALUs for general integer superscalar execution. 
The PA 7200 artwork was scaled down from the PA 7100s 
0.8-micrometer HP CMOS26B process for fabrication in a 
0.55-micrometer HP CMOS14A process. 

Fig. 1 shows the PA 7200 in a typical .symmetric multiproces- 
sor system configuration and Fig. 2 is a block diagram of the 
PA 7200. 

Processor Overview 

The PA 7200 VLSI chip contains all of the circuits for one 
processor in a multiprocessor system except for external 
cache arrays. This includes integer and floating-point execu- 
tion units, a 120-entry fully associative translation lookaside 
buffer (TLB) with 16-block translation entries and hardware 
TLB miss Support, off-chip instruction and data cache inter- 
faces for up to 2M bytes pi off-chip cache, an assist cache, 
and a system bus interface. The floating-point unit in the 



PA 7200 is the same as that in the PA 7100 and retains the 
PA 7100s 2-cycle latency and fully pipelined execution of 
single and double-precision add, subtract, multiply, FMPVADD, 
and FMPYSUB instructions. The instruction cache interface 
and integer unit are enhanced for superscalar execution of 
integer instruction pairs. The bus interface and the assist 
cache are completely new designs for the PA 7200. 

In addition to the performance features, the PA 7200 con- 
tains several new architectural features for specialized 
applications: 

• Little endian data format support on a per-process basis 

• Support for uncacheable memory pages 

• Increased memory page protection ID (PID) size 

• Load/store "spatial locality only" cache hint 

• Coherent I/O support. 

The CPU is fabricated in Hewlett-Packard's CMOS14A pro- 
cess with 0.55-micrometer devices and three-level metal 
interconnect technology. The processor chip is 1.4 by 1.5 cm 
in size, contains 1.3 million t ransistors. and is packaged in a 
540-pin ceramic PGA. IEEE 1149.1 JTAG-compliant bound- 
ary scan protocol is included for chip test and faull isola- 
tion. Fig. 3 is a photomicrograph of the PA 7200 CPU chip. 

Instruction Execution 

A key feature of the PA 7100 that is retained in the PA 7200 
is an execution pipeline highly balanced for both high-fre- 
quency operation and very few (compared to most current 
microprocessors) pipeline stall cycles resulling from data. 
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control, aiul fetch dependencies. The only coiniiion pipe- 
line stall penalties are a one-cycle load-use interlock for any 
cache hit, a one-cycle penalty for the immediate use of a 
floating-point result, a zero-to-one-cycle penalty for a mis- 
predicted branch, and a one-cycle penalty for store-load 
combinations. The PA 7200 improves on the PA 7100 pipe- 
line by removing the penalty for store-store combinations. 
This was achieved by careful timing of off-chip SRAMs, 
which are cycled at the full processor frequency. Removal of 
the store-store penalty is particularly helpful for code that 
has bursts of register stores, such as the code typically 
found at procedure calls and state saves. 




The PA 7200 features an integer superscalar implementation 
geared to high-frequency operation similar to the PA 7100LC 
processor/' In a superscalar processor, more than one in- 
struction can be executed in a single clock cycle. When two 
instructions are executed each cycle, this is also referred lo 
as bundling or dual-issuing. In previous PA 7100 processors, 
only a floating-point operation could be paired with an inte- 
ger operation. The PA 7200 adds the ability to execute two 
integer operations per cycle. This will benefit many applica- 
tions that do not have intensive floatingpoint operations. To 
support Ibis integer superscalar capability, the PA 7200 adds 
a second integer ALU, two extra read ports and one extra 
write port in the general register stack, a new prcdecoding 
block, a new instruction bus, additional register bypassing 
circuits, and associated control logic. 

Instinct ions are classified into three groups: integer opera- 
tions, loads and stores, and floating-point operations. The 
PA 7200 can execute a pair of instructions in a single cycle if 
they are from different groups or if they are both from die 
integer operation group. Brandies are a special case of inte- 
ger operations: they can execute with the preceding instruc- 
tion but not with the succeeding instruction. Double-word 
alignment is not required for instructions executing in the 
same cycle. As in the PA 7100, only floating-point operations 
can bundle across a cache line or page boundaries. The 
PA 7200 can also execute two instructions writing to the 
same target register in a single cycle. 

The PA 7200 contains three instruction buses that connect 
the instruction cache interface to two integer ALUs and a 
floating-point unit. As in the PA 7100, an on-chip double- 
word instruction buffer assists the bundling of two instruc- 
tions that may not be double-word aligned. On every cycle, 
one or two instructions can come from any of four sources 
(even or odd instructions from the cache, or even or odd 
instructions from the on-chip buffer) and can go to any of 
the three destination buses. 



Fig. 3. PA 72UII CPU chip. 
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The process by which multiple instructions are dispatched 
to different instruction buses leading to corresponding 
execution units is called steering. The PA 7200 has a very 
aggressive timing budget for steering antl instruction decod- 
ing (done in less than one processor cycle): therefore, the 
steering logic must be fast. In addition, on every cycle, the 
control logic needs to track which one or two of the three 
instruction buses contain valid instructions as well as the 
order of concurrently issued instructions. To avoid having 
superscalar steering and execution decode logic degrade the 
CPU frequency, six predecode bits are allocated in die in- 
struction cache for each double word. Data dependencies 
and resource conflicts are checked and encoded in prede- 
code bits as instructions are moved front memory into the 
cache, when timing is more relaxed. These six predecode 
bits are carefully designed so that they are optimal for both 
the steering circuits and the control logic for proper pipe- 
lined execution. Thanks to the optimized design and Imple- 
mentation of these predecode bits and the associated steer- 
ing circuits and control logic, this path is not a speed-limiting 
path for the PA 7200 chip and does not obstruct its high- 
frequency operation. 

To minimize area, shift-merge and lest condition units are 
not duplicated in the second ALL 1 . Thus shifts, extracts, 
deposits, and instructions using the test condition block are 
limited to one per cycle. Also, instructions with test condi- 
tions cannot be bandied with integer operations or loads or 
stores as their successors. A modem compiler can minimize 



the effect of these few superscalar restrictions through code 
scheduling, thereby allowing the processor to exploit much 
of the instruction-level parallelism available in application 
code to achieve a low average CPI (cycles per instruction ). 

Data Cache Organization 

Fig. 4 shows the PA 7200's data cache organization. The chip 
contains an interface to up to 1M bytes of off-chip ilirect 
mapped data cache consisting of industry -standard SRAMs 
The off-chip cache is cycled at the full processor frequency 
and has a one-cycle latency, 

The chip also includes a small fully associative on-chip assist 
cache. Two pipeline stages are associated with address gen- 
eration, translation, and cache access for both caches, which 
results in a maximum of a one-cycle load-use penalty for a 
hit in either cache. The on-chip assist cache combined with 
the off-chip cache together form a level- 1 cache. Because 
this level- 1 cache is accessed in one processor cycle and 
supports a large cache size, no level-2 cache is supported. 
The ability to access the large off-chip cache with low latency 
greatly reduces the CPI component associated with cache- 
resident memory references. This is particularly helpful for 
code with large working data sets. 

The on-chip assist cache consists of 64 fully associative 
-'12-byte cache lines. A content-addressable memory (CAM) 
is used to match a translated real line address with each 
entry' s tag. For each cache access. 65 entries are checked 
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for a valid match: 64 assist cache entries and one off-chip 
cache entry. If either cache hits, the data is rei timed directly 
lo the appropriate functional unii with the same latency. 
Aggressive self-timed logic is employed to achieve the 
timing requirements of the assist cache lookup. 

Lines requested from memory as a result of either cache 
misses or prefetches are initially moved to the assist cache. 
Lines are moved out of the assist cache in first-in, firsl-out 
order. Moving lines into the assist cache before moving 
I hem into the off-chip cache eliminates (he thrashing behav- 
ior typically associated with direct mapped caches. For ex- 
ample, In the vector calculation: 

for i: = 0 to N do 
Afi]: = BW + Ctn + DlO 

if elements A|iJ. B[i], C[i). and D(i) map to the same cache index, 
I hen a direct mapped cache alone would thrash on each 
element of the calculation. Tins would result in 32 cache 
misses for eight iterations of this loop. With an assist cache, 
however, each line is moved into the cache system without 
displacing the others. Assuming sequential 32-bit data ele- 
ments, eight iterations of the loop causes only the initial 
four cache misses. 

Larger caches do not reduce this type of cache thrashing. 
While modem compilers are often able to realign data struc- 
tures to reduce or eliminate thrashing, sufficient compile 
time information is not always available in an application to 
make the correct optimization possible. The PA 7200's assist 
cache eliminates cache thrashing extremely well with mini- 
mal hardware and without compiler optimizations. 

Lines that are moved out of the assist cache can condition- 
ally bypass the off -chip cache and move directly back to 
memory. A newly defined spatial locality only hint can be 
specified in load and store instructions to indicate that data 
exhibits spatial loc ality but not temporal locality. A data line 
fetched from memory for an instruction containing the spa- 
tial locality hint is moved into the assist cache like all other 
lines. Upon replacement, however, the line is Hushed back 
to memory instead of being moved to the off-chip cache. 
This mechanism allows large amounts of data to be pro- 
cessed without polluting the off-chip cache. Additionally, 
cycles are saved by avoiding one or two movements of the 
cache line across the (54-bit interface to the off-chip cache. 

The assist cache allows prefetches lo be moved info the 
cache system in a single cycle. Prefetch returns are accumu- 
lated independently of pipeline execution. When the com- 
plete line is available, one data cache cycle is used to insert 
the line into the on-chip assist cache. If an instruction that is 
not using the cache is executing, no pipeline stalls are 
incurred. 

Because the assist cache is accessed using a translated 
physical address, it adds an inherently critical speed path to 
the chip microarchitecture. An assist cache access consists 
of virtual cache address generation, translation lookaside 
buffer (TLB) lookup to translate the virtual address into a 
physical address, and finally the assist cache lookup. The 
TLB lookup and assist cache lookup need to be completed 
in one processor cycle or S.3 n.s for 120-MHz operation. To 
meet the speed requirements of this path a combination of 
dynamic and self-timed circuit techniques is used. 



The TLB and assist cache are composed of content-address- 
able memory (CAM) structures, which differ from more typi- 
cal random-access memory ( RAM) structures in that they 
are accessed with data, which is matched with data stored 
in the memory, rather than by an index or address. A typical 
RAM structure can be broken into two halves: an address 
decoder and a memory array. The input address is decoded 
to determine which memory element to access. Similarly, a 
CAM has two parts: a match portion and a memory array, In 
the case of the assist cache, the match portion consists of 
27-bit comparators that compare the stored cache line tag 
with the translated physical address of the load or store in- 
struction. When a mat ch is detected by one of the compara- 
tors, then that comparator dumps the associated cache line 
data. 

Fig. 5 shows the timing of an access to the TLB and assist 
cache. Tliis single 8.3-us clock cycle path is broken into nud- 
tiple subsections using self-timed circuits. An access begins 
when the single-ended virtual address is latched and con- 
verted to complementary - predischarged values VADDR and 
VAODR in the TLB address buffer on the rising edge of CK. 
These dual-rail signals are then used to access the CAM 
array. A dummy CAM array access, representing the worst- 
case timing through the CAM array, is used lo initiate (he 
TLB RAM access. If any of the CAM entries matches the 
VADDR, then the completion of the dummy CAM access, 
represented by TLB READ_CK . enables the TLB read control 
circuits to drive one of the TLB RAM read lines. The pre- 
charged RAM array is then read and a differential predis- 
charged physical address is driven to the assist cache. 
A similar access is then made to the assist cache CAM and 
RAM structures to produce data on the rising edge of CK. 
A precharged load aligner is used to select the appropriate 
pan of the 256-bit cache line to drive onto the data bus and 
to perform byte swapping for big-to-little-endian data format 
conversion. Although this path contains tight timing budgets, 
careful circuit design and physical layout ensure that it does 
not limit the processor frequency. 

The basic structure of the external cache remains unchanged 
from the PA 7100 CPU. Separate instruction (I) and data (D) 
caches are employed, each connected to the CPU by a (>4-bit 
bidirectional bus. The cache is virtually indexed and physi- 
cally tagged to minimize access latency. The I-cache data 
and tag arc addressed over a common address bus. IADH. 
The D-cache data has a separate address bus, DADH, and the 
D-cache tag has a separate address bus, TADH. Used in con- 
junction with an internal store buffer for write data, the split 
D-cache address allows higher-bandwidth stores to the D- 
cache. Instead of a serial read-modify-writc, stores can be 
pipelined so that TADH can be employed for (he tag read of a 
new store instruction w hile DADH is used to write the data 
from the previous store instinct ion. 

As in the PA 7100 CPU. the PA 7200 CPU cache interface is 
tuned to work with asynchronous SRAMs by creating special 
clock signals for optimal read and write timing. The cache is 
read with a special latch edge that allows Wave pipelining, 
that is. a second read is launched before the first read is 
actually completed. The cache is written using two special 
clocks that manipulate the write enable and output enable 
SRAM controls for a minimum total write cycle time, 
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The design team worked closely with several key SRAM 
vendors to develop a specification for a 6-ns SRAM with 
enhanced write speed capabilities. These new SRAMs allow 
both of the caches to operate at the CPU clock frequency. 
The CPU can be shipped with equal-sized instruction and 
data caches of up to 1M bytes each. As in the PA 7100 CPU, 
a read can be finished in one clock cycle. However, to match 
the bandwidth of the Runway bus and lo increase the perfor- 
mance of store-intensive applications, a significant timing 
change was made lo improve lite bandwidth for writes to 
the cache. The F'A 7200 CPU achieves a quasi-single-cycle 
write: a series of N writes requires N+l cycles. The one- 
cycle overhead is required for turning the bus around from 
read to write, that is, one cycle is required to turn off the 
SRAM drivers and allow the CPU drivers to take over. No 
penalty is incurred in transitioning from write to read. 

Prefetching Mechanisms 

A significant amount of execution lime is spent waiting for 
data or instructions to be returned from memory. In an 



HP 9000 K-class system running transaction processing ap- 
plications, an average of about one cycle per instruction can 
be attributed to the processor waiting for memory. The total 
CPI for such an application is about 2. Execution time can 
therefore be greatly reduced by reducing the number of 
cycles the processor spends wailing for memory. The 
PA 7200 incorporates hardware and software prefetching 
mechanisms, which initiate memory requests before the 
data or instructions are used. 

Instruction Prefetching. The PA 7200 implements an efficieni 
Instruction prefetch algorithm. Instruction fetch requests 
are issued speculatively ahead of the instruction execution 
stream. Multiple instruction prefetch requests can be in 
flight to the memory system simultaneously. Issuing multiple 
prefetches ahead of the execution stream works well when 
linear code segments are initially encountered. This instruc- 
tion prefetching scheme yields a 9% performance speedup 
on transaction processing benchmarks. 
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Data Prefetching. The PA-RISC instruction set includes ;i 
rlass of instructions thai modify the base value in a general 
register by ail immediate displacement or general register 
index value. An example is LDWX.m r1(r2),r3. The LDWX (load 
word indexed) instruction with a modify completer (,m) 
loads the value at the address contained in register r'2 into 
register rS, and then adds rl to r2 (i.e., load r2-> r3; rl ♦ r2-> 
r2). The PA 7200 can use this class of instructions to specu- 
late what data may soon be accessed by the code si ream. If 
the load r2 in the above example is a cache miss, a prefetch is 
issued to lhe address calculated by lhe base register modifi- 
cation (rl i r2). The PA 7200 uses (his base register modifica- 
tion to Speculate where a future data reference will occur. 
For example, if rl contains line Ox 10 and r2 contains line 
0x100 and no lines are initially in the cache, then I his In- 
struction initiates a request for line Ox 100 in response lo lhe 
cache miss and line 0x140 is prefetched. If the line 0x140 is 
later used, some or all of lhe cache miss penally is avoided. 

When a line is prefetched, it is moved into lhe assist cache 
and lagged as being a prefetched line. When a prefetched 
line is later referenced by the code stream, another prefetch 
is launched. Continuing with the above example, if this load 
instruction were contained in a loop, on the firsi iteration of 
the loop lines (1x100 and 0x1 10 would be requested from 
memory. On the second iteration line 0x1 10 is referenced. 
The assist cache (let eels this as the first reference lo a pre- 
fetched line and initiates a prefetch of line Ox ISO. This 
allows memory requests to slay ahead of the reference 
Stream, reducing lhe stall cycles associated with memory 
latency. 

The PA 7200 allows four data prefetch requests to be out- 
standing a! one lime. These prefetches can be used for 
either prefetches along multiple dala reference si reams or 
farther ahead on one data reference .stream. Returning to 
the vector example, 

for i : = 0 to N do 
A[i] ; = B[i] + C[i] * D[i] 

each new cache line entered will cause four new prefetch 
requests lo be issued: one for each vector. On the other 
hand, if the processor were doing a block copy: 

tori: = OtoN 
Alii : = Bli] 

then it could prefetch two lines ahead of each reference 
si ream. 

Reducing Average Memory Access Time 
A number of features have been combined in lhe PA 7200 to 
minimize the average memory access time (the average 
number of cycles used for a memory reference ). |;( These 
features together provide excellent performance speedups 
on a number of applications that stress the memory liierar- 
chy. Fig. i> c ompares the performance of the PA 7200 and the 
PA 7100 on a number of technical benchmarks. To minimize 
the average memory access time associated with cache hits, 
the large low-latency off-chip cache from the PA 7100 design 
has been retained and enhancements made to allow single- 
cycle stores. The PA 7200 improves on the PA 7100 by reduc- 
ing cache misses by minimizing compulsory, capacity, and 
conflict cache misses. 
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BCCeSS time allow the PA 7200 CPU to outperform its predecessor 
the PA "ion on technical benchmarks, 

The PA 7200 reduces conflict misses by adding effective 
associativity lo entries of the main cache. This is done with- 
out the overhead required for a large multiset associative 
cache. Traditionally caches have been characterized as di- 
rect mapped, multiset associative, or fully associative. The 
PA 7200 assist cache effectively adds dynamically adjusted 
associativity to main cache entries. As miss lines are 
brought into the assist cache, the entries with the same 
cache index mapping in the main cache are nol immediately 
replaced. This allows multiple cache lines with the same 
index to reside in "the cache" at the same time. All assisi 
cache entries can be filled with lines thai map to the same 
off-Chip cache index, or they can be filled with entries that 
map to various indexes. This eliminates the disastrous 
thrashing thai can occur with a direct mapped cache, as 
discussed earlier. 

The PA 7200 reduces compulsory cache misses by prefetch- 
ing lines thai are likely lo be used. When the software has 
the information necessary at compile time to anticipate vvhai 
rlata is needed, the base register modification class of load 
and store instructions can be used to direel prefetching. If 
no specific direction is added to code or if old code is being 
run. then base register modifying loads and stores can still 
be used by the hardware lo do effective prefetching. The 
processor can also be configured to use loads and stores 
thai do not modify base registers to initiate speculative 
requests. Because memory bandwidth is limited, care was 
taken to minimize lhe amount of bad prefetching while max- 
imizing lhe speedup realized by issuing memory requests 
speculatively. Both old code traces and new compiler opti- 
mizations were investigated to determine the best set of 
prefetching rides. 

In addition to the large caches supported by the PA 7200, 
capacity misses are reduced by selectively allocating lines to 
the off-chip cache if they benefit from being moved to the 
off-chip cache. More effective use can be made of a given 
cache capacity by only moving data that exhibits temporal 
locality to the off-chip cache. The assist cache provides an 
excellent location for use-once data. The spatial locality 
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only (.SI) hint associated with load and store instructions 
allows code to identify which data is use-once (or simply too 
large to be effectively cached ). thereby reducing capacity 
misses. The .SI hint is encoded in previously reserv ed load 
and store instruction fields. Large analytic applications and 
block move and clear routines achieve excellent speedups 
from this new cache hint 

Bus Interface 

The PA 7200's Runway bus interface is carefully tuned SO 'he 
requirements and capabilities of the processor core. The 
interface has several features that minimize transaction la- 
tency, reduce processor cost, and take advantage of particu- 
lar attributes of the CPL' core to simplify interface design. 
The bus interface contains a cache coherence queue and 
transaction buffers, arbitration logic, and logic to support 
multiple processor-lo-bus-frequency ratios. The bus inter- 
face also implements an efficient doiibli' .iHoo/rt algorithm 
for coherent transaction management. 

The PA 7200 connects directly to the Runway bus without 
transceivers or interface chips. Without this layer of exter- 
nal logic, system cost is reduced while performance is in- 
creased because of lower CPI "-to-bus latency. Special sys- 
tem and circuit designs allow the Runway bus to run at a 
frequency of 120 MHz while maintaining connectivity to six 
loads. Negative-hold-time receiver design and light skew 
control prevent races when drivers and receivers operate 
from the same clock edge. A read transaction is issued in 
one bus cycle and the 32-byte memory return is transferred 
in four cycles, resulting in a peak sustainable bandwidth of 
7G8 megabytes per second. To take advantage of the high 
bus bandwidth, the PA 7200 can have up to six memory 
reads in flight al the same time. 

To minimize read transact ion latency, the PA 7200 asserts 
and captures arbitration signals on the half cycle (phase), as 
shown in Fig. 7. The processor core communicates its intent 
to initiate a transaction in the first phase, allowing the inter- 
face to assert its bus arbitration signal on the second phase. 

I A snoop, also known as a cache coherency check, is the action performed by all processors 
and I/O adapters when Ihey observe a coherent transaction issued by another module Each 
module perlorming the snoop must check us cache lor the address ol Ilia current transaction 
and il (ound, chanqe Ihe stale of that cache address Cache state transitions are described in 
the article on page 1R 



The transaction address information, only available on the 
third phase, is then forwarded from the processor core to 
the bus interface. In the common case where there is no 
contention for the Runway bus. the address is driven onto 
the bus in the next cycle. Read and write buffers, included in 
the bus interface to decouple the CPI' core in case arbitration 
is not immediately won, are bypassed in the common case 
to reduce latency. 

'transactions from the read and write buffers are issued by 
the bus interface with fixed priorities. Snoop data lias the 
highest priority, followed by read requests, then the write of 
cache victims. When the memory controller cannot handle 
new read requests and the read and write buffers are full, 
the bus interface will issue the write transaction before the 
read to make best use of the bus bandwidth available. 

Since transactions on the Runway bus are always accepted 
(and never rejected or retried at the expense of bus band- 
width ). each processor acting as a third party must be able 
to accept a burst of coherent transactions. Since there are 
times when the CPI" core is busy and cannot accept a snoop, 
the bus interface implements a ten-transaction-deep queue 
for cache SnDODS and a three-transaction-deep queue for 
TLB snoops. With deep coherency queues, a large number of 
coherent transactions from several processors can be out- 
standing without the need to invoke flow control. 

Processor-to-bus frequency ratios of 1:1. 3:2, and 4:3 are 
provided for higher-frequency proc essor upgrades. Using a 
ratio algorithm that requires the bus clock to be synchro- 
nous with the processor clock ensures that the ratio logic 
does not impart synchronization delays typical of systems 
with asynchronous clock domains. For any ratio, the worst- 
case delay is less than one CPU clock cycle, and in the best 
case, data transmission does not incur any delay. 

To minimize processor pipeline stalls resulting from multi- 
processor interference, transact ions at the head of the co- 
herency queue are forwarded to Ihe CPI I core in two steps. 
Fust, the core is sent a lightweight query, which steals one 
cycle of c ac he bandwidth. A low-latenc y response is received 
front the off-chip and assist caches. ' Inly when a cache stale 
modification is required is a second full-service eatery for- 
warded to the CPU core. Since the vast majority of cache 
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snoops result in misses, this double snoop approach allows 
the PA 7200 to achieve higher multiprocessor performance 
without the added cost and complexity of a dual-ported 
cache or duplicate cache tags. 1-1 

PA 7200 Circuit Translation 

Most of the PA 7200 circuit designs, artwork, and physical 
design methodology are based upon and leveraged from the 
PA 7100 CPU, which was designed using HP's CMOS26 IC 
process, tools, and libraries. However, aggressive perfor- 
mance and cost goals required that the PA 7200 be fabricated 
using the faster, denser CMOS14 IC process also under 
development. To completely redesign and lay out existing 
PA 7100 circuits for the CMOS14 process would have been 
ail inefficient use of resources and would have greatly ex- 
tended the design phase. Therefore, the entire PA 7200 was 
designed using the existing CMOS20 technology, and the 
artwork was then automatically translated to and reverified 
in the CMOS14 process. 

Unfortunately, automatic translation faced two global issues. 
First. CMOS2<> is a 5.0V (nominal) process but CMOS14 was 
originally specified for 4.0V operation. Simulations showed 
t hat the speed of a few common circuit topologies did not 
scale linearly into the target technology because of the lower 
supply voltage. Detailed investigation by the CMOS14 devel- 
opment group concluded that raising the supply voltage by 
10% was feasible and the process was fully qualified for 
operation at 4.4V. This was sufficient for these circuits to 
meet the speed improvement goal. 

Secondly. C'MOS2(i layout rules do not scale uniformly into 
the respective rules for CMOS14, since each component of a 
process tecluiologv has different physical and manufacturing 
constraints. A simple gate-shrink algorithm, which only re- 
duces FET effective gate length, could have provided a 20% 
transistor speed improvement. Without overall area reduc- 
tion, the extra PA 7200 functionality dictates a die size much 
larger than the PA 7100 and this approach would result in 
slower wire speeds and a sharp increase in manufacturing 
cost. With aggressive scaling, a more complex translation 
algorithm, and a limited number of engineering adjustments 
to the layout and electrical rides, the CMOS 14 version 
ac hieves a 20% overall speed improvement along with a 38% 
power reduction from the original CMOS26 design. 

Translation Methodology. The methodology that was devel- 
oped accommodates CMOS20 designs and translated 
CMOS14 artwork in parallel, is generally transparent, and 
merges smoothly with the existing design environment. A 
hierarchical ( block-level I translation methodology was cho- 
sen because it provides many advantages over the more 
traditional fiat ( mask-level ) translation. Important reasons 
for selecting this approach w-ere: 
• Algorithm flexibility. The optimal translation algorithm is 
not required to guarantee that every pathological CMOS20 
layout, and more important, all existing PA 7100 blocks are 
translated to a legal CMOS14 layout as long as a manage- 
able number of violations result and are easily correctable 
by hand. Hierarchical methods imply editing only unique 
instances of a violation at the block level, rather than the 
entire set on a flattened mask. 



• Design modularity. Having parallel hierarchies containing 
both CMOS2G and CMOS14 blocks enables additional flexi- 
bility. Translated artwork can be read directly by the front- 
end editors for electrical simulation and other purposes On 
the top-level routing blocks. C.MOS14 layouts using a tighter 
metal pitch were a necessary alternative to the translated 
CMOS2C versions. 

• Concurrent methodology. Translated artwork is available 
for mask generation along with the original block. Flat 
translation is serialized and for complex algorithms implies 
a costly delay after each design release. Moreover, having a 
complete, hierarchical CMOS 14 artwork database allowed 
subsequent chip revisions to be released using incremental 
changes made directly to the CMOS 14 artwork. 

Many operations in the translation algorithm are compli- 
cated by hierarchical junctions (these would disappear with 
a flat translation.) A hierarchical junction is any connection 
between objects in separate blocks. If individual artwork 
features touching or extending beyond hierarchical bound- 
aries are further shrunk by a fixed distance after being re- 
duced by the scaling coefficient, gaps will occur at the par- 
ent junctions that cannot always be filled automatically. 
A subtle but more troublesome scaling problem is caused by 
snapping the location of child instances to the grid resolu- 
tion, which creates shape misalignments or gaps at parent- 
child or child-child junctions if origins round in a different 
direction. This effect can be cumulative, and becomes signif- 
icant for junctions that span multiple hierarchical levels. 
Increased database size and consistency checking are other 
drawbacks of a block-oriented translation. 

A final check was added after CMOS14 layout verification to 
hierarchically compare ports, signals, and connectivity be- 
tween the CMOS2(i and CMOS14 artwork netlists. This was 
necessary since hand corrections made to the translated 
CMOS14 layout could introduce new design errors. 

Translation Algorithm. Any scaling coefficient should ensure 
that all minimum widths, spaces, and exact-size shapes from 
CMOS26 be translated to CMOS14 such that each edge pah- 
snaps to the grid resolution (0.05-um) in the same direction. 
There are several natural solutions to ensure that 1.0-um 
(drawn) minimum features in CMOS26 always become 
0.()-um minimum features in CMOS14. For example: 

• Scale by a = 0.8 and then fun her shrink interconnect by 
0.2 urn. 

• First shrink interconnect by 0.2 um and then scale by 
a = 0.75. 

The second option is only practical for library blocks since it 
is too aggressive for interconnect with minimum contacted 
pitch and provides less margin for the effects of uneven grid 
snapping. The detailed algorithm is based upon the first op- 
tion, with additional manipulations of n-well regions, FET 
gate extensions, contact sizes, interconnect contact enclo- 
sure, and interlayer contact spacing. These operations have 
parasitic effects which can create notches and narrow cor- 
ners and are usually correctable by automatically filling new 
width and spacing violations. 



32 Fehniary 19W1 Hc-wiPii-I'ackard Journal 

© Copr. 1949-1998 Hewlett-Packard Co. 



There were still a residual number of geometrical cases that 
could not be fully translated by any reasonable tool or heu- 
ristic. In these cases we either waived the layout rules 
where margin was available or made extra efforts to repair 
rule violations by hand. Although many of these violations 
did occur, the vast majority resulted either from the hierar- 
chical phenomena described earlier or from fundamental 
scaling issues with certain contact structures and latch-up 
prevention rules. In no case was any significant block relay- 
out required, however. 

Scaling-Sensitive Circuits. Although algorithmic translation of 
PA 7200 circuits generally improves electrical performance 
and decreases parasitic effects, there are a few exceptional 
circuits with different characteristics. In general, these were 
abnormally sensitive to transistor sizing ratios, noise caused 
by coupling, voltage shifts caused by charge sharing, small 
variations in processing parameters, or the reduced 4.4V 
high level. Additionally, total resistance in the third layer of 
metal can increase after translation and cause routing delays 
to improve less than the basic- scaling assumptions predict. 

Summary 

The design goal for the PA 7200 was to increase the perfor- 
mance of Hewlett-Packard computer systems on real-world 
applications in a variety of markets while maintaining a high 
degree of price/performance scalability and a low system 
component count. General application performance is im- 
proved through an increase in operating frequency, a second 
integer ALU for enhanced superscalar execution, and im- 
proved store instruction performance. For applications that 
operate on large data sets, such as typical analytic and 
scientific applications, the hardware prefetching algorithms 
and fully associative assist cache implemented in the 
PA 7200 provide excellent performance increases. In addi- 
tion, the processor includes a high-bandwidth, low-latency 
nuilt iprocessor bus interface to support cost-effective, high- 
performance, one-way to lout -way multiprocessor systems, 
which are ideal for technical or commercial platforms, with- 
out additional interface chips. Additionally, the PA 7200 is 
scalable from desktop workstations to many-way multipro- 
cessor corporate computing platforms and supercomputers. 
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Verification, Characterization, and 
Debugging of the HP PA 7200 
Processor 

To guarantee a high-quality product the HP PA 7200 CPU chip was 
subjected to functional and electrical verification. This article describes 
the testing methods, the debugging tools and approaches, and the impact 
of the interactions between the chip design and the IC fabrication 
process. 

by Thomas B. Alexander, Kent A. Dickey, David N. Goldberg, Ross V. La Fetra, James R. McGee, 
Nazeem Noordeen, and Akshya Prakash 



The complexity of digital VLSI chips has grown dramatically 
in recent years. Rapid advances in integrated circuit process 
tec hnology have led to ever-increasing densities, which have 
enabled designers to design more and more functionality into 
a single chip. Electrically, the operating frequency of these 
VLSI chips has also gone up significantly, This has been a 
result of the increased speed of the transistors (CMOS tran- 
sistors are commonly called FETs, for field effect transistors) 
and the fact that the circuits are closer to each other than 
before. All this has had tremendous benefits in terms of per- 
formance, size, and reliability. 

The increased complexity of the VLSI chips has created new 
and more complicated problems. Many Sophisticated tech- 
niques and tools are heing developed to deal with this new 
set of problems. Nowhere is this better illustrated than with 
CPUs, especially in design verification, both functional and 
electrical. While design has always been the focus of atten- 
tion, verification has now become a very challenging and 
critical task. In fact, verification activities now consume 
more time mid resources than design and are the real limit- 
ers of time to market. 

On the functional side, for many years now it has been im- 
possible to come even close to a complete check of all pos- 
sible states of the chip. The challenge is to do intelligent 
verification (both presilicon and postsilicon ) that gives very 
high confidence that the design is correct and that the final 
customer will not see any problem. On the electrical side, 
the challenge has been to find the weak links in the design 
by creating the right set of environments and tests that are 
most likely to expose failures. The increased complexity of 
the VLSI chips has also made isolation of a failure down to 
the exact FET or signal an increasingly difficult task. 

This paper presents the verification methodology, tech- 
niques, and tools that were used on the HP PA 7200 ( PC to 
guarantee a high-quality product. Fig. 1 shows the PA 7200 
CPC in its pin-grid array package. The paper describes the 
functional and electrical verification of the PA 7200 as well 
as the testing methods, the debugging tools and approaches, 



and the impact of the interactions between the chip design 
and the IC fabrication process. 

Functional Verification 

The PA 7200 ("PC underwent intensive design verification 
efforts to ensure the quality and correctness of its function- 
ality. These verification efforts were an integral part of the 
CPU design process. Verification was performed at ail stages 
in the design, and each stage imposed its own constraints on 
the testing possible. There were two main stages of func- 
tional verification: the presilicon implementation stage and 
the postsilicon prototyping stage. 

Presilicon Functional Verification 

Since the design of the PA 7200 was based upon the PA 7100 
CPC we chose to use the same modeling language and pro- 
prietary simulator to model and verify its design. During the 
implementation stage a detailed simulation model was built 




Fig. 1. The PA 7200 CPU in its pin-grid array package, with the lid 
removed to reveal the chip. 
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to verify the correctness of the design. Early in the imple- 
mentation stage, software behavioral models were used to 
represent portions of the design and these were incremen- 
tally replaced by detailed models. A switch-level model was 
also used late in the implementation stage to ensure equiva- 
lence between the actual design implementation and the sim- 
ulation model. This switch-level model was extracted from 
the physical designs FET artwork netlists and was used in 
Uie final regression testing of the design. 

Test cases were written to provide thorough functional cov- 
erage of the simulation model. The test case strategy for the 
PA 7200 was to: 

• Rim all existing cases derived for previous generations of 
PA-RISC processors 

• Run all architectural verification programs (AVPs) 

• Write and run lest suites and cases directed at specific func- 
tional areas of the implementation, including the newly 
designed multiprocessor bus interface ( Runway bus) and its 
control unit, the assist cache, the dual-issue control unit, 
and other unique functionality 

• Generate and run focused random test cases that thoroughly 
stress and vary processor state, cache state, multiprocessor 
activities, and timing conditions in the various functional 
units of the processor. 

Existing legacy test cases and AVPs targeted for other gen- 
erations of PA-RISC processors often had to be converted or 
redirected in a sensible way to yield interesting cases on the 
PA 7200. Additional test cases were generated to create 
complex interactions between the GPU functional units, 
external bus events, and the system state. An internally de- 
veloped automated test case generation program allowed 
verification engineers to generate thousands of interesting 
cases that focused upon and stressed particular CPU units 
or functions over a variety of normal, unusual, and boundary 
conditions. In addition, many specific cases were generated 
by hand to achieve exact liming and logical conditions. 
Macros were written and a macro preprocessor was used to 
facilitate high productivity in generating lest case conditions. 

All test code was run on the PA 7201 ) CPU model and on a 
PA-RISC architectural simulator and the results were com- 
pared on an instruct ion-by instruction basis. The test case 
generation and simulation process is shown in Fig. 2. A PA 
7200-specific version of the PA-RISC architectural simulator 
was developed to provide high coverage in the areas of multi- 
processor-specific conditions, ordering rules, cache 
move-in, move-out rules, and cache coherence. Some por- 
tions of the internal CPU control model were also compared 
with the architectural simulator to allow proper tracking 
and checking of implementation-specific actions. Since the 
PA 7200 was designed lo support several proccssor-to- 
system-bus frequency ratios, the simulation environment 
was built to facililale running tests at various ratios. 

The architected state of the CPU and simulator, including 
architected registers, caches, and TLBs, was initialized at 
model startup time. Traces of instruction execution and rele- 
vant architected slate from the CPU model and from the 
PA-RISC simulator were compared. These traces included 
disassembled code, affected register values, and relevant 
load/store or address information, providing an effective 
guide for debugging problems. 
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Fig. 2. PA 720(1 test case generation and simulation process. 

A test bench approach was used to model other system bus 
components and to verify proper system and multiprocessor 
behavior, including adherence lo the bus protocol. The test 
bench accepted test case stimulus to stimulate and check 
proper CPl" operation. Multiprocessor effects on the caches 
and the pipeline of the CPU being tested were checked in 
detail both by instruction execution comparison and by final 
state comparison of architected registers, caches, and TLBs. 

The bulk of the testing during the implementation stage en- 
tailed running assembly language test vectors on the simula- 
tion model. The principal limitation of this stage was the 
limited execution speed of the simulation model. 

As components of the simulation model became defined and 
individually lested. they were combined into increasingly 
larger components until a combined simulation model was 
built for the entire computer system including processors. 

memory, and l/o. 

An effort was also made lo evaluate the lesl case coverage 
of processor control logic to ensure that we had high cover- 
age of the functional units with normal and corner-case con- 
ditions. During our regressions of functional simulation, I he 
simulation model was instrumented to provide coverage 
data, which was posl processed to yield coverage metrics on 
the control logic. 
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This verification effort consumed many engineering months 
in test bench development, test case generation, and test 
checking. Billions of simulation cycles were run on scores 
of high-performance technical workstations during nights 
and weekends in several different geographical locations. 
The result of this effort is a high-quality CPl" that booted its 
operating system and enabled postsilicon functional and 
electrical verification efforts soon afterthe First silicon pans 
were received. 

This simulation approach also facilitated a productive de- 
bugging and regression testing environment for testing fixes. 
Specifically, w hen making a correction to the CPU, the simu- 
lation environment allowed the verification team to run re- 
gression suites that stressed the faulty area, providing more 
complete simulation coverage of the problem. 

Postsilicon Functional Verification 

Despite the massive presilicon testing, there are always 
bugs that are found once the first hardware becomes avail- 
able. Bugs that affect all chips regardless of temperature, 
voltage, or frequency are termed functional bugs. Bugs that 
are made worse by environmental conditions or do not 
occur in all chips are termed electrical problems, and the 
test strategy for finding those problems is detailed in the 
section on electrical verification. 

Testing machines as complex as the IIP 9000 J-class and 
K-class systems, the first systems to use the PA 72(H), was a 
large effort involving dozens of people testing specific areas. 
This section will describe how the processor design labora- 
tory created processor-focused tests to find processor bugs. 
Many other people contributed to testing other portions of 
the systems with intentional overlap of testing coverage to 
ensure high quality. 

Because of the complexity of modern processor chips, not 
all bugs are found in presilicon testing. The processor is so 
complicated that adequate testing would take years running 
at operational speed to hit all the interesting test cases. Pre- 
silicon testing is orders of magnitude too slow to hit that 
many cases in a reasonable amount of time. Thus, when 
presilicon testing stops finding bugs, the chip is manufac- 
tured and postsilicon testing commences. However, finding 
bugs is not as simple as just turning the power On and wait- 
ing for the bugs to appear. One problem is deciding what 
tests to run to look for failures. Poorly Written tests will not 
find bugs that customers might find. Another problem is 
debugging failures to their root causes in a timely manner. 
Knowing a problem exists is a great stall, but sometimes 
discovering exactly what has gone wrong in such complex 
systems can be very difficult. Postsilicon testing loses much 
of the observability of processor state that was easily- 
obtained in the simulation environment 

To provide high cov erage of design features, three testing 
tools were prepared to stress the hardware. These tools 
were software programs used to create tests to run on the 
prototype machines. Each tool had its own focus and in- 
tended overlap with other tools to improve coverage All 
tools had a proven track record from running on previous 
systems successfully. To ensure adequate testing, two tools 
were heavily modified to stress new features in the PA 7200. 



All of the tools had some features in common. They all ran 
standalone independently on prototype machines under a 
small operating system. Because they did not run under the 
HP-UX* operating system, much better machine control 
could lie achieved. In addition, not needing the HP-UX oper- 
ating system decoupled hardware debugging from the soft- 
ware schedule and let the hardware laboratory find bugs in 
a timely manner. (Later in the verification process. HP-UX- 
based system testing is performed to ensure thorough cover- 
age. However, the team did not rely on this to find hardware 
problems. ) All hardware test tools also had the ability to 
generate their own code sequences and were all self-check- 
ing. Often these code sequences were randomly generated, 
but some tools supported hand-coded tests to stress a par- 
ticular hardware feature. 

Uniprocessor Testing 

Even though PA 7200 systems support up to four processors, 
it is desirable to debug any uniprocessor problems before 
testing for the much more complex multiprocessor bugs. 
The first tool was leveraged from the PA 7100LC effort to 
provide known good coverage of uniprocessor functionality. 

Tltis tool operated by generating sequences of pseudorandom 
instructions on a known good machine, like an HP 9000 
Model 7:i5 workstation. On this reference machine, a simula- 
tor would calculate the correct expected values and then 
create a test to be run on the prototype harclw r are. This test 
would set up hundreds of various initial states and run the 
prepared sequence. Each time it ran the sequence, the tool 
would determine if it got the correct result and display any 
differences. Since much of the work was done on another 
machine to prepare the correct answer, this tool was very 
robust and was a good initial bring-up vehicle. It also could 
run its sequences very quickly and give good coverage in a 
Short amount of time. However, uniprocessor bugs ramped 
down very quickly, and so this tool was used much less after 
initial bring-up. 

Multiprocessor Testing 

The verification team was especially concerned with multi- 
processor hugs, since experience indicated that they are 
much more difficult to find and debug than uniprocessor 
cases. These complex bugs were often found later in the 
project. For this reason, the two other tools used were 
heavfiy modified to enhance PA 7200 testing for multipro- 
cessor- corner cases. 

The first multiprocessor-focused tool attempted to do ex- 
haustive testing of the effects of various bus transactions 
interacting with a test sequence. The interference transac- 
tions were fixed but were chosen to hit all the cases that 
were considered interesting. The test sequence could be 
randomly generated or written manually to stress a particu- 
lar area of the processor. 

To determine if a test operated properly, the tool would run 
the test sequence once without any interference from other 
processors. It would capture the machine state after this run 
(register, cache, memory I and save it as the reference re- 
sults. The tool did not need to know what the test was doing 
at all — it simply logged whatever result it got at the end. To 
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test multiprocessor interference transactions, the tool 
would then arrange to have other processors try all combi- 
nations of interesting transactions selected to interact with 
the sequence. This was accomplished by running the test in 
a loop with the interference transactions being moved 
through every possible cycle in a predetermined timing win- 
dow. This exhaustive testing of interference transactions 
against a code sequence provided known good coverage of 
certain troublesome areas. When there were failures, many 
useful debugging clues were available regarding which 
cases passed and which cases failed to help in tracking 
down the bug. 

The main deficiency of this tool was that it relied on high 
uniprocessor functionality. If there were bugs that could 
affect the reference run. the tool would not be able to detect 
them. Thus, this tool could not run until uniprocessor func- 
tionality was considered stable. As it turned out. the initial 
PA 7200 silicon had very high uniprocessor functionality and 
so testing began on iniiial silicon. One advantage of this tool 
over the uniprocessor tool was that il could run for an un- 
limited amount of time on the prototype hardware and gen- 
erate test cases on its own. This ability made running this 
tool much simpler than the uniprocessor tool. 

The final tool was the backbone of the functional verifica- 
tion effort. In many ways, this tool merged the good ideas 
from other tools into one program to provide high coverage 
with ease of use. This tool generated sequences of pseudo- 
random instructions on each processor, ran the sequences 
across all the processors simultaneously, and then checked 
itself. It calculated the correct results as it created the in- 
struction sequences, so it could run Unattended for an un- 
limited time. The sequences and interactions it could gener- 
ate were much more stressful than the other tools. Much of 
this ability came from the effort put into expanding this tool, 
but some of it came from basic design decisions. 

This tool relied completely on pseudorandomly generated 
code sequences to find its bugs. The tool look probabilities 
from an input file which directed the tool to stress certain 
areas. These focused tests enhanced coverage of certain 
processor functionality such as the new data prefetching 
ability of the PA 7200. Almost any parameter that could be 
changed was changed constantly by this tool to hit cases 
beyond what the verification team could think of. Having 
almost no fixed parameters allowed this tool to hit bugs that 
no other tool or test has ever hit. 

This final tool received additional modifications to test DMA 
(direct memory access) between peripheral cards and mem- 
ory. The new Runway bus added new bus protocols involving 
I/O transactions, which the processor needed to obey to 
ensure system correctness. DMA was used to activate these 
bus protocols to verify that the PA 7200 operated properly. 
To make sure these extra cases were well-covered. DMA 
was performed using various peripheral devices while the 
processor testing was done. This extra testing was worth 
the investment since several bugs were round that might not 
have been caught otherwise. 

The poslsilicon verification effort was considered successful 
because the team found almost every bug before other 
groups and could communicate workarounds for hardware 
problems to keep I hem from affecting software schedules. 



The operating system testing actually found very few pro- 
cessor bugs, and all serious bugs were found by the postsili- 
con hardware verification team. Some of the later hardware 
bugs found may never be encountered by the current operat- 
ing sy stem because the compilers and the operating system 
are limited in the code sequences they emit. However, the 
hardware has been verified to the point that if a future com- 
piler or operating system tises a feature not used before, it 
can in all likelihood do so without encountering a bug. 

Electrical Verification 

Electrical verification of a VLSI device is performed to guar- 
antee that when the product is shipped to a customer, the 
device will function properly over the entire operating region 
specified for the product. The operating region variables 
include ambient temperature, power supply voltages, and 
the clock frequency of the VLSI device. In addition, electri- 
cal verification must account for integrated circuit fabrica- 
tion process variation over the life of the device. Testing for 
sensitivities to these variables and improving the design to 
account for them improves fabrication yield and increases 
the margin of the product. This section describes the vari- 
ous electrical verification activities performed for the PA 
7200 CPU chip. 

Electrical Characterization 

Electrical characterization refers to the task of creating dif- 
ferent test environments and test code with the goal of iden- 
tifying electrical failures on the chip. Once an electrical fail- 
ure is detected, characterization also includes determining 
the characteristics of the failure like dependencies on volt- 
age, temperature, and frequency. 

Electrical failures may manifest themselves on one, several, 
or every chip at some operating point (temperature, voltage, 
or frequency) of the CPl". Electrical failures cause the chip 
to malfunction and typically have a root cause in some elec- 
trical phenomenon such as the familiar hold time or setup 
lime violation. As chip operating frequencies increase, other 
electrical phenomena such as coupling between signals, 
charge sharing, and unforeseen interchip circuit interactions 
increasingly become issues. 

To ensure a high level of quality, various types of testing and 
lest environments are used to check that all electrical fail- 
ures are detected and corrected before shipment to custom- 
ers. Dwell testing and shmoo testing are two types of testing 
techniques used to characterize chips. 

For the PA 7200. dwell testing involved running pseudo- 
random code on (he system for extended periods of time at 
a given voltagc-temperalure-l'reqiienev point. Since the test 
code patterns are extremely important for electrical verifica- 
tion, dwell testing was used to guarantee thai the pseudo- 
random code would generate sufficient patterns to test the 
CPL' adequately, 

Shmoo testing involves creating voltage-frequency plots 
(shmoo plots) by running test code at many voltage- 
frequency-temperature combinations. Fig. 3 shows a typical 
sly le of shmoo plot. This plot is for a failing chip that has 
some speed problems. By examining the shape of the shmoo 
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Fig. 3. Voltage-frequency shmoo plot 

plot, much can be learned about the design of the chip. Volt- 
age-frequcncy-lempcrature points well beyond die legal oper- 
ating range should be included in the shmoo plot. It is not 
sufficient to rely only on the minimum allowed margin (in 
terms of voltage-frequency-teinperature) to determine if the 
design is robust. The test code run for creating shmoo (dots 
is extremely important. Simple code can create a false sense 
of quality. 

Testing Environments 

There were four main testing environments for the PA 7200: 
system characterization, chip tester characterization, pro- 
duction characterization, and functional characterization. 

System Characterization. This testing is focused on running 
the CPU in the actual system and altering the operating vari- 
ables to determine the characteristics of the design. The 
variables that are involved here are test code, ambient tem- 
perature, voltages (internal chip voltage, I/O pad voltage, 
and cache SRAM voltage), frequency of the chip, types of 
CPU chips (variations in manufacturing process), types of 
c ache SRAMs (slow versus fast), and system bus speed. 
Various types of test code are run on the system, including 
pseudorandom PA-RISC code, HP-UX application code, and 
directed PA-RISC assembly code. 

Chip Tester Characterization. This li sting consists of running a 
set of chips processed with different manufacturing process 
variables on a VLSI chip tester over ranges of temperature, 
voltage, and frequency, using a set of specific tests written 
for the PA 7200. The chip tester can run any piece of code at 
operating frequency by providing stimulus and performing 
checks at the I/O pins of the chip. Testing is accomplished 
through a mixture of parallel and scan methods using a VLSI 
test system. The majority of testing is done with at-speed 
parallel pin tests. Tests written in PA-RISC assembly code 
for the PA 7200 that cover logical functionality and speed 
paths are converted tltrough a simulation extraction process 
into tester vectors. Scan-based tests are used for circuits 
such as standard-cell control blocks and PLA structures. 



which are inherently difficult to test fully using parallel pin 
tests. These parallel tests are run on the tester well beyond 
the operating speed of the chip. 

Production Characterization. All PA 7200 chips go through a 
set of tests on the chip lesler. Since a large number of chips 
are manufactured for prototyping purposes, the results of 
the notmal manufacturing tests are very valuable for charac- 
terization. This testing provides characterization data for all 
the chips that are manufactured with a set of specific tests 
written for the PA 7200 over ranges of temperature, voltage, 
and frequency. Parallel and scan tests written for the PA 
7200 are run within the operating range possible on cus- 
tomer systems as well as in well-defined regions of margin 
Outside this operating range. This type of testing over all the 
chips shows electrical failures that could happen if there are 
variations in the manufacturing process over time. 

Functional Characterization. This testing involves running 
pseudorandomly generated tests on the system at the nomi- 
nal operating point for very long periods of time (months). 
Even though this testing uses code environments targeted 
for functional verification, it can be very effective in detect- 
ing electrical issues. This type of testing can often find any 
test cases (circuit paths) that have not been covered in the 
prior three types of testing and will reduce the chance that 
the customer will ever have any electrical problems. 

Debugging 

When a problem is seen within Ihe operating region of the 
Chip, (he problem must be debugged and fixed. Tests are run 
well beyond Ihe operating region to look for anomalies. 
Failures outside the operating region are also understood to 
make sure that the failure will not move into the operating 
region ( with a different environment, test, or manufacturing 
process shift ). The root causes of these electrical problems 
need to be characterized and understood. In the character- 
ization of the problem many chips are tried in various envi- 
ronments to understand the severity of the problem. To un- 
derstand the cause of the failure, the lesl code is analyzed 
and converted to a small directed lest with only the pertinent 
failing sequence. This is necessary to limit the scope of ihe 
investigation. Then the problem is further analyzed on the 
chip tester. The chip tester can run any piece of code at 
speed but it can run only reasonable sizes of code because 
of the amount of tester memory. The tester can perform 
types of experiments that the system cannot provide, such 
as varying the clock cycle for a certain period of time. This 
process is called phase stretching (see Fig. 4). Often the 
failing path can be determined at this point based on phase 
Stretching experiments. Various other techniques can also 
be used on the tester to isolate the failing path. Once the 
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Fig. 4. Tuning diagram of a phase Stretched dock. The normal 
period (T) of the clock la shown in cycles 1.2, and 4. The normal 
phase lime is T/2. In the second phase of cycle 3, the pha-sr is 
stret ched by time A for a total phase time of T/2 + A. 
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failing path is isolated, the electrical failing mechanism needs 
to be understood. Various tools are used to determine the 
failing mechanism. 

One method to help identify the failing mechanism is to use 
an electron-beam (E-beam) scoping tool on the chip tester. 
In this process, the failing test is run in a loop on the tester 
and internal signals are probed to look at waveforms and 
the relationships between signals. It is very similar to using 
an oscilloscope to look at a signal on a printed circuit board 
except that it is done at the chip level. 

As final confirmation of the failing mechanism, the failing 
circuit is modeled by the designer. The electrical compo- 
nents of the circuit path are extracted and simulated with a 
circuit simulator (SPICE). The modeling needs to be accu- 
rate to reproduce the failure on the simulator. Once the fail- 
ing mechanism is confirmed in SPICE, a fix is developed and 
verified. 

Since a chip turn to determine whether the fix will work and 
thai the fix has no side effects takes a long time, the fix can 
often be verified with a focused-ion-beam (FIB) process. 
FIBitig is a process by which the chips internal connections 
can be modified, thereby changing its behavior or function- 
ality. In the FIB process, metal wires can be cut, or joined by 
metal deposition. FIB is an extremely valuable lool to verify 
fixes before implementing them in the actual design. 

Afler the electrical failing mechanism is understood, addi- 
tional work is done to create the worst-case test for this 
failure. The insight gained from understanding the root cause 
allows the test to be tailored to excite I he failing mechanism 
more readily. This can c ause the test to fail more often, at a 
lower or higher frequency, a lower or higher voltage, or a 
lower or higher temperature. Developing a worst -case lest is 
;ui extremely important step. The extent of the original prob- 
lem cannot be understood unlil the worst-case test is devel- 
oped. Including the worst-case test in the production screen 
ensures that parts shipped to customers will never exhibit 
the failure even under varying operating conditions and the 
most stressful hardware and software environments. 

These points can be illustrated with a case study. The nomi- 
nal operating point of the PA 7200 is 120 MHz at a V DU of 
4.4 volts, hi this particular example, a failure occurred while 
running a pseudorandom test at 5.1 volts and 120 MHz at 
high temperature (55°C ambient). Even though the PA 7200 
is not required to operate at this voltage the verification 
team did not expect this failure. Thus, this problem needed 
to be characterized and its root cause understood. 

In this example, this chip was the only one that failed at 
5.1 volts. However, a few other chips failed at even higher 
voltages. This problem was worse at higher frequencies and 
higher temperatures. The test code that was failing was con- 
verter! from pseudorandom system code to tester code. Next 
the test code was run on the tester and analyzed. Since this 
problem did not occur at lower frequencies, each phase of 
the clock in the test was stretched to determine which clock 
phase made the chip pass or fail. The internal slate of the 
chip was also dumped out on the tester using serial scan. 
The failing and passing internal scanned states were com- 
pared to see which slates were improperly set. This helped 
to isolate the failing path. Once this was done, the failing 



path for this failure was analyzed to understand the electri- 
cal failing mechanism. For this problem. E-beam was used 
to understand the failing mechanism. 

Fig. 5 shows the circuit that was failing in this debugging 
example. The circuit is a latch with the signal LRXH control- 
ling the transfer of data into the latch. When LRXH and CK1N 
(clock) are true (logic 1 ). the latch is open and the inverted 
level of the input RCV gets transferred to the output HM2. 
When LRXH is false (logic 0). the latch is closed and the out- 
put HM2 holds its state. Fig. 6 shows the waveforms of the 
internal signals that were captured through E-beam. The last 
two signals. CK1N and CK2N. are the two-phase clock signals 
on the chip. The passing and failing waveforms for LRXH and 
HM2 are shown at the top of the figure. The passing wave- 
forms for LRXH and HM2 are called LRXH/4.7V@lrd-0ns and 
HM2/4.7@lrd-0ns. respectively. The failing waveforms for LRXH 
and HM2 are called LRXH/4.7V@lrd - 1.0ns and HM2/4.7@lrd - 1.0ns, 
respectively. The input signal RCV (not shown in the figure ) 
is 1 during the first two pulses of LRXH shown, 0 during the 
third pulse, and 1 thereafter. The output HM2 is expected to 
transition from 0 to 1 during the third LRXH pulse and stay 1 
until the fourth pulse. However, the slow falling edge on LRXH 
causes a problem. In the failing case, on the third LRXH pulse. 
HM2 transitions from 0 to 1 but the slow falling edge on LRXH 
also lets the next input value of RCV ( 1 ) propagate to the 
output HM2. HM2 therefore transitions back to 0. In the pass- 
ing case, the falling edge of LRXH arrives a little earlier and 
the output HM2 maintains what was captured in the latch (1). 
Once the failing mechanism was understood, the worst-case 
test was developed. In this case study, the worst-case test 
caused many parts to fail at nominal conditions. The failing 
mechanism was modeled in a circuit model by the designer. 
Once this was done, a fix was developed. FIB was used to 
verify the fix. This failure mechanism was fixed by speeding 
up LRXH by adding a buffer to the long route of LRXH. Fig. 7 
shows how this was done. The figure is a photograph of a 
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die that was FIBed to buffer the LRXH signal. To do this, the 
long vertical metal 3 wire on the right side of the figure was 
cut with the FIB process and a buffer was inserted. A buffer 
was available on the left side of the figure; however, metal 3 
covered this buffer. The FIB process was used to elch the 
metal 3 area surrounding the buffer to expose the metal 2 
connections of t he buffer. The FIB process was then used to 
deposit metal to connect the metal 2 of the buffer to tine 
vertical metal 3 wires. The FIBed chip was then tested to 
make sure that the failing mechanism was fixed. 

Testability 

We leveraged the test circuits and strategy for the PA 7200 
from the PA 7100 chip. The scan controller was required to 
change from our proprietary diagnostic instruction port to 
the industry-standard JTAG. This was a minimal change, 
since both protocols do the same function. The new test 
controller was leveraged from the PA 71001.C chip to keep 
the design effort to a minimum. Before tape release we veri- 
fied that the basic test circuits would work. 

Since the test circuits were leveraged from the PA 7100. the 
obvious choice was to leverage the test strategy from the 
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Fig. 6. Waveforms of the internal 
signals of the foiling latch circuit, 
captured by electren-besra probing. 

PA 7100 chip as well. A fast parallel pin tester was chosen 
early on as the tester for the PA 7200. This tester would pro- 
vide both parallel pin testing and scan testing. We decided 
that data path circuits would be tested by parallel pin testing, 
and scan testing would be limited to a few control blocks. 
All speed testing was to be done with parallel pin testing. 

Benchtop Tester 

Since the parallel pin tester was located elsewhere, we 
knew we could not use it for local debugging of the chip. 
Many problems needed only simple debugging capability 
and could be greatly accelerated by the presence of a local 
debugging station. For that purpose, we chose an inexpen- 
sive benchtop tester developed internally. This tester applied 
all vectors serially to the chip. Vectors developed for serial 
use could be used as is. The parallel pin vectors could be 
translated into what we called pin rectors, which is a 
boundary scan, looking-into-the-chip approach. No speed 
testing capability" was planned, although some support for 
speed testing was present in the PA 7200. 

The PA 7200 chip has on-chip clock control. This was essen- 
tial to our success because the benchtop tester was not 
practically able to provide a separate clock control signal. 
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Fig. 7. Once the latch problem 
was bund and a fix developed, 
I he fix was verified by modifying 
one die using a focused-ion-beam 
(FIB) process. The long vertical 
metal 3 wire on the right was cut 
with the FIB process and a buffer 
was inserted. A buffer was avail- 
able on the left side of the figure; 
however, metal -i covered this 
buffer The FIB process was used 
to etch the metal 'i area sur- 
rounding the buffer to expose 
the metal 2 connections of the 
buffer. The FIB process was then 
used to deposit metal to connect 
the metal 2 of the buffer to the 
vertical metal -i wires. The FIBed 
chip was then tested to make 
sure that the failing mechanism 
was fixed. Photo courtesy of 
FAST (TIB Applied Semiconduc- 
tor Technology ), San .lose, Cali- 
fornia, 



The tester can ( and did ) issue clock com rol commands in 
the serial data. Having these commands interpreted on-chip 
saved us from having to build thai circuitry off-chip. This 
made the chip test fixture very simple. 

The benchtop tester was the only means of standalone chip 
testing we had collocated with the design team, and there- 
fore was very important to the debugging efforts. The tester 
used a workstation as a controller and interface, and was 
capable of storing very long vectors (limited only by the 
workstation's Virtual memory). We had the ability to load the 
entire parallel pin vector suite (590 million shifts) into lite 
benchtop tester at one time, although this took so long as to 
be practically prohibitive. The benchtop tester had both 
scan and some limited parallel pin capabilities for driving 
reset pins. 

Benchtop Tester Environment 

The benchtop tester was based on an HP-UX workstation 
and could be operated from a script. This allowed us to put 
our own script wrappers around the software, which pro- 
vided essential control for power supplies and the pulse 
generator. These script wrappers also provided transparent 
workarounds to some of the limitations of the tester. 

We had two testers that we controlled access to via HP Task- 
Broker. By using HP TaskBroker. we could easily share the 
lesl fixtures between the various uses, such as test develop- 
ment, chip debugging, and automatic test verification. For 
chip debugging, an engineer could obtain an interactive lock 
on the tester (a window would pop up when an engineer got 
the tester), and did not have to worry about interference 
from an unattended job trying to run. Also, a test could be 
initiated from an engineer's desk, and when a tester was 



free, the test would run and return the results to the engi- 
neer. HP TaskBroker handled all the queuing and priority 

issues. 

As our experience increased and our needs became dear, 
we wrote more simple scripts around those we already had. 
This allowed us to write complex functions as composites of 
simple blocks. 

Double Step 

As chip bring-up progressed, we found that we could benefit 
from some simple local speed test capabilities. As a result, 
we chose to implement basic speed testing on the benchtop 
tester stations we had in place. 

We employed programmable pulse generators and had the 
software to control the frequency. All that was needed was 
to convert the tests to double-step pin vectors and make 
sure they worked. A double-step pin vector is the same as a 
single-step pin vector, except that two chip cycles are run at 
speed. This requires that the I/O cells be able to store two 
values, not just one as would be Required tor single stepping. 
This feature was already in the I/O cell design. 

By converting the tests to double-step pin vectors and mak- 
ing some minor changes to the design, we got double-step 
pin vectors working. This capability to do al-speed local 
testing was very valuable in debugging the chip. 

Additional Tools 

A simple tool was put together to produce slimoo plots of 
about 60 points for a single test. We spent considerable 
effort optimizing this script. The engineers doing debugging 
found this very valuable. 
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When doing speed patli debugging, the engineer wants to 
know which cycles are slow. One way to Find oiil is to take a 
Tailing test and make some cycles slower, and if I he chip 
passes, that means thai I he chip was failing on one of those 
cycles, .lust observ ing the pins is not enough, however, since 
a failure may stay inside the chip for a while before propa- 
galing to the pins. We implemented this kind of lesl by 
changing our pin vector slrategy from a double step to a 
combination of half steps and single steps during selected 
cycles of the test. Since the clock commands take a long 
lime lo shift in, this effectively slows down some cycles of 
the test. Wc call this style of testing phase stretching (see 
Fig. 4). 

Another very valuable lool was an automali'd phase slrelch- 
ing tool. It would take a chip and find the slow cycles with a 
given sel of tests. This would lake a few hours, but need nol 
be supervised, so overnight tests worked well. While this 
would not tell what the problem was directly, il provided 
valuable clues. 

We also had the ability to run part of a lest, then slop it and 
dump the stale of the interna] scan chains. A chip expert 
could look ai these dumps and see what went wrong. Use of 
Ihis lool was extremely useful during our debugging efforts. 

The benchlop testers were considered very valuable lo the 
debugging of the PA 7200. The software written for the 
testers contributed greatly to their success. The benchlop 
testers became known for their reliability, ease of use, and 
locality. 

Design-Process Interactions 

To achieve the highest quality in any VLSI product, il is very 
important to ensure thai there Is good harmony in the rela- 
tionship between the chip design and the chip fabrication 
process. This relationship on the PA 7200 wenl through 
some rocky roads and had its own interesting set of prob- 
lems. In the end, however, the desired harmony was achieved 
and is reflected in the high quality and yields of the final 
product. This section will describe the situation thai existed 
on the PA 7200 and some of the steps taken lo anticipate 
and smooth out problems in this area. 

The characteristics of the IC process have a big influence on 
decisions made at every step of the development cycle of a 
VLSI product, stalling from the early stages of the design. 
The Influence can be seen in many areas like the goals of the 
design, the feature sel to be included, and the details of the 
implementation ai the transistor level. For example, the pro- 
cess dictates the intrinsic speed of the transistor, which is a 
key factor in setting the frequency goals of the chip. 
Similarly, the minimum feature sizes (line width, spacing, 
etc. ) of the process largely dictate the size of the basic stor- 
age or memory cell. This in I urn is a factor thai determines, 
for example, the size of the TLB on the die. which is a key 
component in determining the performance of a micro- 
processor. An example of this influence ai Ihe implementa- 
tion level would be an input pad receiver designed t o I rip at 
a particular voltage level on Ihe external ( input ) signal. The 
implementation has lo ensure that ihe trip level is fairly 
tightly controlled at all comers of the process, which is nol 
easy to do. Another trivial example is Ihe size of a power or 



ground trace. The size of ihe trace required in carry a cer- 
tain amount of current is largely dictated by the resistance 
and elect romigral ion limits of the metal. 

There were (wo target HP IC processes in mind when the 
design or the PA 7200 began: CMOS26 and CMOS14. 
CMOS20 was the process of Ihe previous generation CPUs, 
the PA 7100 and t he PA 7100LC. Its benefits were thai it was 
a very mature and stable process. Also, some circuits of the 
PA 7100 are used in the PA 7200 with little or no modifica- 
tion, and the behavior of these circuits was well-understood 
in this process. CMOS 14 was the next-generation process 
being dev eloped. Its benefits were, obviously, smaller fea- 
ture size and better FET speed. However, only a few simple 
chips had been fabricated in this process before the PA 7200, 
and many startup problems were likely to be encountered. 
Thai we had a choice influenced the design methodology. 
Taking advantage of the scalability of CMOS designs, Ihe 
initial design was done in CMOS2G. An artwork shrink pro- 
cess was developed to convert the design to CMOS14. The 
shrink process is a topic thai merits special attention and is 
described in ihe article on page 25. 

As Ihe design wenl along, ii became clear thai to meet Ihe 
performance and size goals of Ihe product. CMOS14 was Ihe 
belter choice. To demonstrate feasibility and lo iron out 
problems with ihe shrink process, the existing PA 7100 CPU 
was taken through the shrink process and fabricated in 
CMOS 14. Several issues were uncovered, leading to early 
detection of potential problems. 

Related to the IC fabrication process, ihe goal of electrical 
verification and characterization is lo ensure that the VLSI 
chip operates correctly for parts fabricated within the 
bounds of Ihe normal process variations expected during 
manufacturing. An incomplete job done here or variations of 
the process outside the normal range can cause subtle prob- 
lems that often gel detected much later on. There are two 
yield calculations thai are often used to quantify the inanu- 
facl inability of a VLSI product. TtlQ functional yield denotes 
Ihe fraction of Ihe toial die manufactured that are fully oper- 
ational tor functional) at some electrical operating point, 
that is, some combination of frequency, voltage, and temper- 
ature. The survival yield denotes ihe fraction of Ihe func- 
tional die that are operational over Ihe entire electrical oper- 
ating range of Ihe product, that is, the product specifications 
for frequency, voltage, and temperature. (In reality, to guar- 
antee this, there is some guardbanding that occurs beyond 
the operating range of the product ) 

To achieve the highest quality and manufaclurabilily of the 
final product, the following are some of the objectives set 
for electrical characterization: 

• Ensure that Ihe design has solid operating margin (in volt- 
age, frequency, and temperature ) for parts fabricated at all 
the different corners of the process. 

• Ensure consistently high survival yield for a siaiisiically 
large number of wafers and lots fabricated. 

• To ferret out problems thai may be otherwise hard lo find, 
fabricate some parts ai points beyond ihe normal variations 
of the process. Debug problems in these parts to ensure the 
robustness of t he design. 

The PA 7200 chip was the first complex VLSI chip to be fab- 
ricated in CMOS14. Thai the process was nol fully mature at 
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that point had important implications on the electrical char- 
acterization and debugging effort. Special care had to be 
taken to distinguish between the different types of problems: 
design problems, proc ess problems, and design-process 
interaction problems 

A fundamental design problem is one thai shows up on every 
lot (a batch of wafers processed together in the fabrication 
shop), whatever the process parameters for that lot might 
be. For example, a really slow path on the chip may liave 
some frequency variation from lot to lot. but will show up on 
every lot. 

Process problems show up on some lots. The most conunon 
symptom is poor functional yield. Sometimes, however, the 
symptom can be a subtle electrical failure that is hard to 
debug. For example, one problem manifested itself as a race 
between the falling edge of a latching signal and new data to 
the latch. SPICE simulations showed that the failure could 
occur only under abnormally unbalanced loads and clock 
skews, which were unrealistic. 

A design-process interaction problem shows up to varying 
degrees on different lots. Il points to a design that is not 
very robust and is treated very much like a design problem. 
However, typically there tends to be some set of process 
conditions thai aggravate the problem. Tighter process 
controls or retargetting die process slightly can reduce the 
impact of such problems temporarily, bul (he long-term 
solution is always to fix the design to make it more robust. 
For example, some coupling issues on the PA 7200 occurred 
only at one corner of the process. By retargetting the pro- 
cess to eliminate thai comer, the survival yield was signifi- 
cantly increased. 

The slirink process mentioned earlier had given us tremen- 
dous benefits in terms of flexibility and I he ability to leverage 
existing circuits. However, effort was also spent in identify- 
ing circuits thai did nol shrink very well. These circuits were 
given special care and modified when the decision to use 
CMOS1-1 was made. Overall, the shrink effort was very suc- 
cessful largely because of the scalability of CMOS designs. 
However, (he characterization and debugging phase exposed 
some interesting new limitations on the scalability of CMOS 
designs. When a shrink-relaled problem circuit was found, 
the chip was scanned for oilier circuits that could have a 
similar problem. These circuils were then fixed to prevent 
future problems. 

Throughout the project, the team always tracked down 
problems to their root causes. This approach guaranteed 



complete fixes for problems and kept them from ever show- 
ing up again. The result is high-quality bug-free parts and 
high yields in manufacturing. 

In addition to finding and fixing problems in the design, 
there was also a related activity that happened in parallel. 
Process parametric data was analyzed in detail for every lot 
to look for an optimum region in the process. Detailed cor- 
relation data was produced between process parameters 
and chip characteristics like speed, failing voltages, types of 
failures, and so on. Many different experiments with process 
parameters and masks were also tried, including polysilicon 
biases, metal thicknesses, and others. This enabled us to 
fine-tune the process to increase the margins, yields, and 
quality of the product. 

Conclusion 

With the increasing complexity of VLSI chips, specifically 
CPUs, design verification has become a critical and chal- 
lenging task. This paper has described the methodology and 
techniques used to verify the PA 7200 CPU. The approaches 
used yielded very good results and led to the efficient detec- 
tion and isolation of problems on the chip. This has enabled 
Hewlett-Packard to achieve high-quality, volume shipments 
in a timely manner. 
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A New Memory System Design for 
Commercial and Technical Computing 
Products 



This new design is targeted for use in a wide range of HP commercial 
servers and technical workstations It offers improved customer 
application performance through improvements in capacity, bandwidth, 
latency, performance scalability, reliability, and availability. Two keys to 
the improved performance are system-level parallelism and memory 
interleaving. 

by Thomas R. Hotchkiss, Norman D. Marschke, and Richard M. McClosky 



Initially used in IIP 9000 K-cIass midrange commercial serv- 
ers and J-cla-ss high-end technical workstations, ihe J/K- 
class memory system is a new design targeted for use in a 
wide range of MP's commercial and technical computing 
products, and is expected to migrate to lower-cost systems 
over time. At the inception of the memory design project, 
there were two major objectives or themes that needed to 
be addressed. First, we focused on providing maximum 
value to our customers, and second, we needed to maximize 
HP's return on the development investment. 

The primary customer value proposition of the J/K-class 
memory system is to maximize application performance 
over a range of important cost points. After intensive studies 
of our existing computing platforms, we determined that 
memory capacity, memory bandwidth, memory latency, and 
system-level parallelism were key parameters for improving 
customer application performance. A major leap in memory 
bandwidth was achiev ed through system-level parallelism 
and memory interleaving, which were designed into the 
Runway bus and the memory subsystem. A system block 
diagram of an HP 0000 K-class server is shown in Fig. 1 on 
page 9. The Runway bus (see article, page 18) is the "infor- 
mation superhighway" that connects the CPUs, memory, and 
I/O systems. System-level parallelism and memory interleav- 
ing means that multiple independent memory accesses can 
be issued and processed simultaneously. This means that a 
CPU's access to memorv - is not delayed while an I/O device 
is using memory. In a Runway-based system with the J/K- 
class memory system, multiple CPUs and I/O devices can all 
be accessing memory in parallel. In contrast, many of HP's 
earlier computing platforms can process only one memory 
transaction at a time. 

Another important customer value proposition is investment 
protection through performance scalability. Performance 
scalability is offered in two dimensions: symmetric multipro- 
cessing and processor technology upgrades to the forthcom- 
ing PA 8000 CPU. The J/K-class memory system provides the 
memory capacity and bandwidth needed for effective per- 
formance scaling in four-way multiprocessing systems. 
Initially. Runway-based systems will be offered with the 



PA-7200 CPU (see article, page 25). and will be upgradable 
to PA 8000 CPI ; technology with a simple CPU module ex- 
change. The J/K-class memory system will meet the demand- 
ing performance requirements of the PA 8000 CPU. 

Performance is only one part of overall system value. An- 
other major component of system value Ls cost. For exam- 
ple, the use of commodity DRAM technology w r as imperativ e 
because competitive memory pricing is an absolute require- 
ment in the cost-sensitive workstation marketplace. The 
J/K-class memory system provides lasting performance with 
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commodity memory pricing and industry-leading price/per- 
formance. Low cost was achieved by using mature IC pro- 
cesses, commodity DRAM technology, and low-cost chip 
packaging. A closely coupled system design approach was 
combined with a structured-custom chip design methodol- 
ogy that allowed the design teams to focus custom design 
efforts in the areas thai provided the highest performance 
gains without driving up system cost. For example, the sys- 
tem PC boards. DRAM memory modules, and custom chip 
I/O circuits were designed and optimized together as a 
single highly tuned system to achieve aggressive memory 
timing with mature, low-cost IC process and chip packaging 
technologies. 

A further customer value important in HP computing prod- 
ucts is reliability and availability. The J/K-class memory sys- 
tem delivers high reliability and availability with IIP propri- 
etary error detection and correction algorithms. Single-bit 
error detection and correction and double-bit error detec- 
tion are implemented, of course; these are fairly standard 
features in modem, high-performance computer systems. 
The J/K-class memory system provides additional availabil- 
ity by detecting single DRAM part failures for x4 and x8 
DRAM topologies, and by detecting addressing errors. The 
DRAM part failure detection is particularly important be- 
cause single-part failures are more common than double-bit 
errors. Extensive circuit simulation and margin and reliabil- 
ity testing ensure a high-quality electrical design that mini- 
mizes t he occurrence of errors. 

Finally, greater system reliability is achieved through com- 
plete memory testing at system boot time. Given the large 
maximum memory capacity, memory test lime was a major 
concern. When full memory testing takes a long time, cus- 
tomers may be inclined to disable complete memory testing 
to speed up the boot process. By using custom firmware test 
routines that capitalize on system-level parallelism and the 
high bandwidth capabilities of the memory system, a full 2G 
bytes of memory* can be tested in less than five minutes. 

Return on Investment 

Large-scale design projects like the J/K-class memory sys- 
tem typically have long development cycles and require 
large R&D investments. To maximize the business return on 
large projects, designs need to provide lasting value and 
cover a wide range of products. Return on investment can 
be increased by improving productivity and reducing lime to 
market. Leveraging and outsourcing should be used as ap- 
propriate to keep HP engineers focused on those portions of 
the design thai prov ide maximum business value. The .1/K 
class memory system project achieved all of I hese important 

objectives. 

A modular architecture was designed so thai different 
memory subsystem implementations can be constructed 
using a set of building blocks with simple. DRAM-technology- 
independent interfaces. This flexible architecture allows the 
memory system to be used in a wide range of products, each 
with different price and performance points. Given the long 

' HP Journal memory size convention 

Ikbyre . 1.000 bytes IKbytw -LOW bytes 

1 MbyiH ■ 1 .000.000 bytes 1 U bytes - 1 048.W6 bytes 
1 Gbytc 1.000.000,000 bytes IG bytes = 1.073.711.824 bytes 



development cycles associaled with large VLSI design proj- 
ects, changing market conditions often require VLSI chips to 
be used in products that weren't specified during the design 
cycle. The flexible, modular architecture of the J/K-class 
memory system increases the design's ability to succeed in 
meeting unforeseen market requirements. Simple interfaces 
between modules allow components of the design to be lev - 
eraged easily into other memory design projects. Thus, return 
on investment is maximized through flexibility and leverage 
potential- 
Reducing complexity is one of the most powerful techniques 
for improving time to market, especially with geographically 
diverse design teams and concurrent engineering. A single 
critical bug discovered late in the development cycle can 
easily add a month or more to the schedule. In the J/K-class 
memory project, design complexity was significantly reduced 
by focusing on the business value of proposed features. The 
basic philosophy employed was to inc lude features that pro- 
vide 80% to 90% of the customer benefits for 20% of the effort; 
this is the same philosophy that drove the development of 
RISC computer architectures. The dedication to reduced 
complexity coupled with a strong commitment lo design 
verification produced excellent results. After the initial de- 
sign release, only a single functional bug was fixed in three 
unique chips. 

Several methods were employed to increase productivity 
without sacrificing performance. First, the memory system 
architecture uses a "double-wide, half-speed" approach. Most 
of the memory system runs at half the frequency of the high- 
speed Runway bus. but the data paths ar e twice as wide so 
that full Runway bandwidth is provided. This approach, 
coupled with a structured-custom chip design methodology, 
made it possible to use highly automated design processes 
mid a mature IC process. Custom design techniques were 
limited to targeted portions of the design that provided the 
grealesl benefits. Existing low-cost packaging technologies 
were used and significant portions of the design were out- 
sourced to third-party partners, l.'singail these techniques, 
high performance, low cost, and high productivity were 
achieved in the J/K-class memory system design project, 

Wide Range of Implementations 

The memory system design for the J/K-class family of com- 
puters covers a wide range of memory sizes from 32M bytes 
in the entry-level workstation ( Fig. 1 ) to 2G bytes in the fully 
configured server ( Fig 2). Expandability is achieved with 
plug-in dual-inline memory modules that each carry :Ji> 4M-bit 
or 10M-bit DRAMs. (Note: Even though the memory modules 
are dual-inline and can be called DIMMs. we usually refer to 
them as SIMMs, or single-inline memory modules, because 
this terminology is so commonly used and familiar. ) Because 
each DRAM dala bus in the memory system is 16 bytes wide, 
the 8-byte wide SIMMs are always installed in pairs. I sing 
the 4M-bit DRAMs. each pair of SIMMs provides 32M bytes 
of memory; and with HiM-bit DRAMs. a pair of SIMMs pro 
Tides 128M bytes of memory, 

In an entry-level workstation, the memory can start at 32M 
bytes dine pair of SIMMs w ilh |\1 bit DRAMs) and he ex 
panded up lo ">12M byles I using -I pairs of SIMMs with 
HiM-bit DRAMs) as shown in Fig. I. The HP l-elass 
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Fig. 2. High-performance HP 9000 Model E40Q memory system. 

workstation can be expanded to 1G bytes of memory using 
eight pairs of SIMMs with 16M-bit DRAMs. The HP 9000 
Model K400 server can be expanded up to 2G bytes using 16 
pairs of SIMMs with 16M-bit DRAMs installed in two memory 
carriers. 

Design Features 

The J/K-class memory system design consists of a set of 
components or building blocks that can be used to construct 
a variety of high-performance memory systems. A primary 
goal of the memory system is to enable system designers to 
build high-bandwidth, low-latency, low-cost memory systems. 
Major features of the design are: 
High performance 
36-bit real address (32G bytes) 

Support for4M-bit 16M-bit, and 64M-bit DRAM technology 
Proven electrical design up to 2G bytes of memory with 
16M-bit DRAMs (up to 8G bytes with 64M-bit DRAMs when 
available) 

Logical design supports up to 8G bytes with KiM-bit DRAMs 
and up to 32G bytes with 64M-bit DRAMs 
Minimum memory increment of 32M bytes with lMx4-bit 
(4M-bit) DRAMs (2 banks of memory on 2 SIMMs) 
32-byte cache lines 

Memory interleaving: 4-way per slave memory controller, 
electrically proven up to 32-way with logical maximum of 
128-way interleaving 

Single-bit error correction, double-bit error detection, and 
single-DRAM device failure detected for x4 and :<8 parts 
Address error detection 

EiTor detection, containment, and reporting when corrupt 
data is received 

Memory test and initialization less than 5 minutes for 2G 
bytes of memory 

Soft -error memory scrubbing and page deallocation imple- 
mented with software 
16-byte and 32-byte write access to memory 
IEEE 1 149. 1 boundary scan in all VLSI parts. 
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Memory System Description 

A block diagram for a high-performance HP 9000 Model 
K400 memory system is shown in Fig. 2. The memory sys- 
tem has four major components: the master memory con- 
troller (MMC), multiple slave memory controllers (SMC), a 
data accumulator/multiplexer (DM), and plug-in memory 
modules (SIMMs). The memory system design allows many 
possible configurations, and the high-performance Model 
K400 system is an example of a large configuration. 

The basic unit of memory is called a bank. Each bank of 
memory is 16 data bytes wide and can be addressed inde- 
pendently of all other banks. A 32-byte processor cache line 
is read or written in two segments using fast-page-mode 
accesses to a bank. Two 16-byte segments are transferred to 
or from a bank Co make up one 32-byte cache line. 

Each slave memory controller (SMC) supports up to four 
independent memory banks. Memory performance is highly 
dependent on the number of banks, so the SIMMs are de- 
signed so that each SIMM contains eight bytes of two banks. 
Since a bank is 16 bytes wide, the minimum memory incre- 
ment is two SIMMs, which yields two complete banks. An 
additional 16 bits of error correction code ( ECC ) is included 
for each 16 bytes (128 bits) of data. Thus, a memory data 
bus carrying 16 bytes of data retinit es 144 bits total (128 data 
bits + 16 ECC bits). 

The 16-byte-wide memory data bus, which connects the 
master memory controller (MMC) to the data multiplexer 
(DM) chip set, operates at (it) MHz for a peak bandwidth of 
960 Mbytes/s. Memory banks on the SIMM sets are con- 
nected to the DM chip set via 16-byte RAM data buses | RD_A 
and RD_B). which operate at 30 MHz, yielding a peak band- 
width of 480 Mbytes/s. However, these data buses are inde- 
pendent, so if memory access operations map to alternate 
buses, the peak bandwidth available from RD_A and RD_B 
equals that of the memory data bus. The actual bandwidth 
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will depend on the memory access pattern, which depends 
on the system workload. 

The set of signals connecting the MMC to the SMC chips and 
DM chips is collectiv ely known as the memory system inter- 
connect (MSI) bus. It is shown in Fig. 2 as the memory ad- 
dress bus and the MI X data bus. 

A 32-byte single cache line transfer requires two cycles of 
data on the RAM data and MSI buses. Since the RAM data 
bus operates at one-half the frequency of the MSI bus, the 
data multiplexer chips are responsible for accumulating and 
distributing data between the Ml*X data bus and the slower 
RAM data buses. To reduce the cost of the data MUX chips, 
the design of the chip set is bit-sliced into four identical 
chips. Each DM chip handles 36 bits of data and is packaged 
in a low-cost plastic quad flat pack. 

Two sets of DM chips are shown in Fig. 2. and four SMC 
chips are associated with each DM set. Logically, the MSI 
protocol supports up lo 32 total SMC chips, up to 32 SMC 
chips per DM set , and any number of DM sets. Presently, 
memory systems having up to eight SMC chips and two DM 
sets have been implemented. 

VLSI Chips 

Master Memory Controller. Each memory system contains a 
single master memory controller. The MMC clup is the core 
of the memory system. It communicates with the processors 
and the I/O subsystem over the Runway bus, and generates 
basic memory accesses which are sent to the slave memory 
controllers via the MSI bus. 

Slave Memory Controller. A memory system contains one or 
more SMC chips, which are responsible for controlling the 
DRAM banks based on the accesses generated by the MMC. 
The partitioning of functionality between the MMC and SMC 
has been carefully designed to allow future support of new 
types of DRAMs without modification to the MMC. The 
motivation for this is that the MMC is a large complex chip 
and would require a large engineering effort to redesign or 
modify. The SMC and DM chips are much simpler and can 
be redesigned with less effort on a faster schedule. The 
memory system design is partitioned so that the MMC does 
not contain any logic or functionality that is Specific to a 
particular type of DRAM. The following logic and functional- 
ity is DRAM-specific and is therefore included in the SMCs: 
DRAM timing 
Refresh control 
Interleaving 

Memory and SIMM configuration registers 
DM control. 

Each SMC controls up to four banks of memory. 

( Ipcration of I lie I iRA.Ms is coin rolled by multiple slave 
memory controllers which receive memory access com- 
mands from the system bus through the master memory 
controller. Commands and addresses are received by all 
SMCs, A particular SMC responds only if it controls the re- 
quested address and subsequently drives the appropriate 
DRAMs with the usual row address strobe (RAS) and column 
address strobe (CAS). 

The slave memory controller chips have configuration regis- 
ters to support the following functions: 



• Interleaving 

• Bank-to-bank switching rates 

• Programmable refresh period 

• SIMM sizes 

• Programmable DRAM riming 

• SMC hardware version number (read only) 

• SMC status registers for latching MSI parity errors. 

Memory refresh is performed by all of the SMCs in a stag- 
gered order so that refresh operations are nonsimultaneous. 
Staggered refresh is used to limit the step load demands on 
die power supply to minimize the supply noise that would 
be caused by simultaneously refreshing all DRAMs ( up to 
1 152 in a Model K400 system). This lowers overall system 
cost by reducing the design requirement for the power 
supply. 

Data Multiplexer Chip Set. The DM chips are responsible for 
accumulating, multiplexing, and demultiplexing between the 
16-byte memory data bus and the two independent 16-byte 
RAM data buses. They are used only in high-performance 
memory systems with more than eight banks of memory. 

Dual-Inline Memory Modules 

The dual-inline memory modules (called SIMMs) used in this 
design are 72 bits wide (64 data bits + 8 ECC bits) organized 
into two half-banks as shown in Fig. 3. With 72 bits of shared 
data lines and two independent sets of address and control 
lines, they hold twice as much memory and supply twice as 
many data bits as the common single-inline memory modules 
used in personal computers. Two SIMMs are used to form 
two fully independent 16-byte banks of memory. Each SIMM 
holds 36 4-bit DRAMs— 18 DRAMs for each half-bank. Using 
36 lMx4-bit DRAMs, each SIMM provides 16M bytes of 
memory. With 4Mx4-bit DRAMs, each SIMM holds 64M bytes. 
To connect all of the data, address, and control signals, a 
144-pin dual-inline socket is used (not compatible with the 
72-pin SIMMs used in PCs). 

The motivation for designing our own memory module 
rather than using an industry-standard SIMM was memory 
capacity and performance. We would have needed twice as 
many PC SIMMs for an equivalent amount Of memory. This 
would create a physical space problem because t wice as 
many connectors would be needed, and would create per- 
formance problems because of the increased printed circuit 
board trace lengths and unterminated transmission line 
stubs. 

However, our custom SIMM is expected to become "industry 
available." There are no custom Vliil chips included on our 
SIMMs, so third-party suppliers can clone the design and 
offer memory to HP customers. This is very important be- 
cause having multiple suppliers selling memory to our cus- 
tomers ensures a market -driven price for memory rather 
than proprietary pricing. HP customers now demand com- 
petitive memory pricing, so this was a "must " requirement 
for the design. 

Banks and Interleaving 

Each SMC has four independent bank controllers that per- 
form a double read or write operation to match the 16-byte 
width of the RAM data path to the 32-byte size of the CPU 
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cache line. Thus each memory access operation to a particu- 
lar hank is a RAS-CAS-CAS* sequence for reading two succes- 
sive memory locations, using the I'asl -page-mode capahility 
of the DRAMs. A similar sequence to another hank can over- 
lap in time so lhal the -CAS-CAS portion of the second hank 
can follow immediately after the RAS-CAS-CAS of the initial 
bank. This is interleaving. 

N-way interleaving is implemented, where N is a power of 2. 
The total number of banks in an inlerleave group is not nec- 
essarily a power of 2. When the number of banks is a power 
of 2, then the bank select for a given physical address is de- 
termined by the X low-order bits of the physical address. All 
banks within an interleave group musl be the same size. 
Memory banks from different-size SIMMs can be installed in 
the same memory subsystem, but they must be included in 
different interleave groups. 

When the number of banks installed is not a power of 2, the 
interleaving algorithm is specially designed to provide a uni- 
form. nearest-power-of-2 interleaving across the entire ad- 
dress range. For example, if you install six banks of the same 
size, you will get 4-way interleaving across all six banks 
rather than 4-way interleaving across 4/61 lis of the memory 
and 2-way interleaving across 2/fiths of the memory. This 
special feature prevents erratic behavior when nonpower- 
of-2 numbers of banks are installed. 

Soft Errors and Memory Scrubbing 

DRAM devices are known to lose data periodically at single- 
bit locations because of alpha particle hits. The rale of oc- 
currence of These soft errors is expected to be one every 
1 million hours of operation. A fully configured 2G-byte 
memory system uses 1152 (32x3(5) DRAM devices. Thus, a 
soft single-bit error is expected to occur once every 868 
hours or ten times per year in normal operation. Single-bit 
errors are easily corrected by the MMC when the data is 
read from the memory using the ECO bits. Single-bit errors 
are corrected on the fly with no performance degradation. 
At memory locations that are seldom accessed, the occur- 
rence of an uncorrectable double-bit error is a real threat to 
proper system operation. To mitigate this potential problem, 

HAS = Row address strooe 
CAS - Column address strobe 



memory-read operations are periodically performed on all 
memory locations to find and correct any single-bit errors 
before t hey become double-bit errors. This memory scrub- 
bing software operation occurs in the background with 
virtually no impact on system performance. 

Sometimes when a particular DRAM device has a propensity 
for soft errors or develops a hard ( uncorrectable) error, then 
thai area of memory is deemed unusable. The memory is 
segmented into 64K-byte pages. When any location within a 
particular page is deemed unusable, then that entire page is 
deallocated from the inventory of available memory and not 
used. Should the number of deallocated pages become 
excessive, the respective SIMM modules are deemed faulty 
and must be replaced at a convenient service time. 

Memory Carrier Board 

The memory carrier board (Fig. 4) is designed to function 
with HP 9000 K-class and J-class computer systems. The 
larger K-class system can use up to two memory carrier 
boards, while the smaller J-class system contains only one 
board, which is built into the system board. Each memory 
carrier board controls from two to sixteen SIMMs. This 
allows each memory carrier board to contain from 32M 
bytes to 1G bytes of memory. 

There are four data multiplexer chips on the memory carrier 
board. These multiplex the two 144-bit RAM data buses 
(RD_A and RD_B) to the single 144-bit MSI data path to the 
MM( ' chip. They also provide data path timing. Four SM< ' 
components on the memory carrier board provide the MSI 
interface control. DRAM control and liming, data MUX 
control, refresh timing and control, and bank mapping. 

The memory carrier board is designed with maximal inter- 
leaved memory access in mind. Each SMC controls four 
SIMM pairs ( actually only four banks because there are two 
banks on each SIMM pair) and one data bus set (two of four 
72-bit parallel RAM data buses). Each data MUX controls 
36 bits of each RAM data bus and 36 bits of the MSI bus. 
Each SIMM has two address buses (one for each bank) and 
one 72-bit RAM data Bus. For example. SMC 0 controls 
SIMM pairs Oa/b. 5a/b. 6a1j, and 3a/b, using the RD_A0I0:71| 
bus for (he SIMMs in the "a" connectors mid the RD_A1I0:71| 
bus for the SIMMs in the "b" connectors. 
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Memory Read or Write Operations 

The memory carrier board operates on basically one kind of 
memory transaction: a :l2-bytc read or write to a single bank 
address. This board will also handle a 16-byte read or wrile 
operation: the timing and addressing are handled just like a 
32-byie operation except that the CAS signal for the unused 
Hi-byte hall" is not asserted. 



To perform a memory read or write operation the following 
sequence of events occurs. First, an address cycle on the 
Runway bus requests a memory transaction from the menu ny 
subsystem. This pari of the operation is taken care of by the 
MMC. The MMC chip then places this address onto the MSI 
bus along with the transaction type code (read or write). All 
the SMCs and the MMC latch this address and transaction 
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code and place this information into their queues. Each SMC 
and the MMC chip must keep an identical set of transaction 
requests in I heir queues; this is how data is synchronized for 
each memory operation (the SMCs operate in parallel and 
independently on separate SIMMs). 

Once the address is loaded into their queues, the SMCs 
check to see if the address matches one of the banks thai 
they control. The SMC that has a match will then start an 
access to the matching bank if that bank is not already in 
use. The other SMCs could also start other memory ac- 
cesses in parallel provided there are no conflicts with banks 
currently in use. 

The memory access starts with driving Ihe row address lo 
the SIMMs followed by the assertion of RAS. The SMC then 
drives the column address to Ihe same SIMMs. This is fol- 
lowed by the assertion of CAS and output enable or write 
enable provided that the data buses are free to be used. At 
this time the SMC sends the MREAD signal or the M WRITE sig- 
nal to the data MUX chips to tell them which direction the 
data will be I raveling. The TACK (transaction acknowledged) 
signal is toggled by the SMC to tell the MMC chip to supply 
data if this is a write operation or receive data if this is a read 
operation. When TACK changes state all of the other SMCs 
and ihe MMC slep up Iheir queues because lliis access will 
soon be completed. 

Once the MMC chip supplies the write data to the data MUXs 
or receives the data from the data MUXs on a read opera- 
tion, it completes the transaction on Ihe Runway bus. The 
memory system can have up lo eight memory transactions 
in progress al one lime and some of them can be overlapped 
or paralleled at the same time. 

The liming for a single system memory access (idle state) in 
a J/K-class system breaks down as follows (measuring from 
the beginning of the address cycle on the Runway bus to the 
beginning of the first data cycle and measuring in Runway 
cycles): 

Cycles Operations during these Cycles 

1 Address cycle on Runway bus 

2 Address received at MMC to address driven on MSI 
bus (actually 1.5 or 2.5 cycles, each with 50% 
probability] 

2 Address on MSI bus 
2 SMC address in to RAS driven to DRAMs 
G RAS driven to output enable driven 
4 Output enable driven to first data valid at data 
MUX input 

4 First data valid at data MUX input to data driven by 
data MUX on MSI bus. With EDO (extended data 
out) DRAMS this time is reduced to 2 cycles. 

2 Data on MSI bus 

2.5 Data valid at MMC input to data driven on Runway 
bus (includes ECC, synchronization, etc.) 

25.5 Total cycles delay from address on Runway bus to 
daia on Runway bus for a read operation. 

Register accesses to SMC chips are very similar to memory 
accesses except that the register data values are transferred 
on the MSI address bus instead of the MSI data bus. 



Board Design Challenges 

Early in the board design it became clear that because of the 
number of SIMMs and the physical space allocated to lite 
memory carrier board, the design would not work without 
some clever board layout and VLSI pinoul changes. After 
several different board configurations (physical shape, 
SIMM placements, through-hole or surface mount SIMM 
connectors, SMCs and data MI X placements) were evalu- 
ated, the final configuration of 16 SIMMs, four dala multi- 
plexers, and four SMCs on each of two boards was chosen. 

Given the very tight component spacing required with this 
configuration, Ihe pinouts of the data MUX and SMC chips 
had lo be chosen carefully. The pinout of the data MUX chip 
was chosen so that the RAM data buses from the SIMMs and 
Ihe MSI data bus to the connector were "river routable" (no 
trace crossing required ). The pinout of the SMC chip was 
chosen wilh Ihe layout of the SIMMs in mind. It also had to 
be possible to mount both chips on the backside of the 
board and still meet Ihe routing requirements. Being able to 
choose chip pinouts lo suit board layout requirements is one 
ill the many advantages of in-house custom chip designs. 
Without I his ability it is doubtful that this memory carrier 
board configuration would be possible. This is another 
example of closely coupled system design. 

One of Ihe goals for this design was to have a board that 
could be customer shippable on first release (no functional 
or electrical bugs). To meet this goal a lot of effort was 
placed on simulating the operating environment of the mem- 
ory subsystem. By doing these simulations, both .SPICE and 
functional (Verilog, etc.), electrical and functional problems 
were found and solved before board and chip designs were 
released to be built. 

For example, major electrical cross talk problems were 
avoided through the use of SPICE simulations. In one case, 
four 72-bit buses ran the length of the board (about 10.5 
inches) in parallel. Each trace was 0.005 inch wide and the 
traces were spaced 0.005 inch apart on die same layer (stan- 
dard PCB design rules) with only 0.0048 inch of vertical sep- 
aration between layers. Five-wire mutually coupled memory 
carrier board and SIMM board models for SPICE were 
created using HP VLSI design tools. When this model set 
was simulated, Ihe electrical cross talk was shown to be 
greater than 60% and would have required a major board 
redesign to fix when found after board release. The solution 
was to use a 0.007-inch minimum spacing between certain 
wires and to use a nonstandard board layer stack construc- 
tion lhat places a ground plane between each pair of signal 
layers. 

The memory carrier board uses several unusual technolo- 
gies. For example, ihe board construction (see Fig. 5) is de- 
signed to reduce interlayer cross talk between the RD_A and 
RD_B data buses. As a result of this board layering, the char- 
acteristic impedance of nominal traces is about 38 ohms tor 
both the inner and outer signal layers. The nominal trace 
width for both inner and outer signal layers is 0.005 inch 
wilh 0.005-inch spacing on Ihe outer layers and 0.007-inch 
spacing on the inner layers to reduce coupling between long 
parallel dala signals. 
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The early SPICE and functional simulation effort paid off. 
No eleclrical or functional bugs were found on the memory 
carrier board, allowing the R&D revision 1 board design to 
l>e released as manufacturing release revision A. 

Another new technology used on the memory carrier board 
is the BERG Micropax connector system. The Micropax con- 
nector system was selected because of its controlled imped- 
ance for signal interconnect and the large number of con- 
nections per inch. However, for these very same reasons the 
connector system requires extremely tight tolerances when 
machining the edge of the board containing the connectors. 

A new manufacturing process used by the memory carrier 
board, the HP 2RSMT board assembly process, was devel- 
oped to allow the surface mounting of extra-fine-pitch VLSI 
components on both sides of the board along with the 
through-hole SIMM connectors. 
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Hardware Cache Coherent 
Input/Output 



Hardware cache coherent I/O is a new feature of the PA-RISC architecture 
that involves the I/O hardware in ensuring cache coherence, thereby 
reducing CPU and memory overhead and increasing performance. 

by Todd J. Kjos, Helen Nusbaum, Michael K. Traynor, and Brendan A. Voge 



A new feature, railed hardware cache coherent I/O, was 
introduced into the HP PA-RISC architecture as part of the 
HP 9000 J/K-class program. This feature allows the I/O hard- 
ware to participate in the system-defined cache coherency 
scheme, thereby offloading the memory system and proces- 
sors of unnecessary' overhead and contributing to greater 
system performance. This paper reviews I/O data transfer, 
introduces the concept of cache coherent I/O from a hard- 
ware perspective, discusses the implications for HP-UX* 
software, illustrates some of the benefits realized by HP's 
networking products, and presents measured performance 
results. 

I/O Data Transfer 

To understand the impact of the HP 9000 J/K-class coherent 
I/O implementation, it is necessary to take a step back and 
get a high-level view of how data is transferred between I/O 
devices and main memory on HP-UX systems. 

There are two basic models for data transfer direct memory 
access (DMA) and programmed I/O (PIO). The difference 
between the two is that a DMA transfer takes place without 
assistance from the host processor while PIO requires the 
liosl processor lo mow I he data In reading and writing reg- 
isters on the I/O device. DMA is typically used for devices 
like disks and LANs which move large amounts of data and 
for which performance is important. PIO is typically used 
for low-cost devices for which performance is less impor- 
tant, like RS-232 ports. PIO is also used for some high-per- 
formance devices like graphics frame buffers if the program- 
ming model requires it. 

All data transfers move data either to main memory from an 
I/O device (inbound) or from main memory to an I/O device 
(outbound). These transfers require one or more transac- 
tions on each bus between the I/O device and main memory. 
Fig. 1 shows a typical PA-RISC system with a two-level bus 
hierarchy. PA-RISC processor-to-memory buses typically 
support transactions in sizes that are powers of 2, up to 32 
bytes, that is, READ4. WRITE4, READ8, WRITE8, READ16, WRITE16, 
READ32. WRITE32. where the number refers to the number of 
bytes in the transaction. Each transaction has a master and 
a slave: the master initiates the transaction and the slave 
must respond. Write transactions move data from the master 
to the slave, and read transactions cause the slave to re- 
spond with data for the master. The processor is always the 
master for PIO transactions to the I/O device. An I/O device 
is always the master for a DM\ transaction. For example, if 



a software device driver is reading (PIO) a 32-bit register on 
the fast/wide SCSI device shown in Fig. 1, it causes the pro- 
cessor to master a READ4 transaction to the device, which 
results in the I/O adapter mastering a REA04 transact ion on 
the I/O bus. where the fast/wide SCSI device responds with 
the four bytes of data. If the Fibre Channel interface card is 
programmed to DMA transfer 4K bytes of data from memory 
to the disk, it will master 128 REA032 transactions to get the 
data from memory. The bridge forwards transactions in both 
directions as appropriate. 

Because PIO transactions are not in memory address space 
and are therefore not a coherency concern, the rest of this 
article discusses DMA transactions only. The coherent I/O 
hardware has no impact at all on I/O software device drivers 
that interact with devices via PIO exclusively. 

Hardware Implications 

Cache memory is defined as a small, high-speed block of 
memory located close to the processor. On the HP PA 7200 
and PA 8000 processors, a portion of the software virtual 
address (called the virtual index) is used as the cache 
lookup. Main memory is much larger and slower than cache. 
It is accessed using physical addresses, so a virtual-to-physi- 
cal address translation must occur before issuing any re- 
quest to memory. Entries in the PA 7200 and PA 8000 caches 
are stored in lines of 32 bytes. Since data that is referenced 
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once by the processor is likely to be referenced again and 
nearby data is likely to be accessed as well, the line size is 
selected to optimize the frequency with which it is accessed 
n hile minimizing the overhead associated with obtaining the 
data from main memory. The cache contains the most re- 
cently accessed lines, thereby maximizing the rate at which 
processor-to-memory requests are intercepted, ultimalely 
reducing latency 

When a processor requests data (by doing loads j. the line 
containing the data is copied from main memory into the 
cache. When a processor modifies data (by doing stores), 
the copy in the cache will become more up-to-date than the 
copy in memory- HP 9000 J/K-class products resolve this 
stale data problem by using the snoopy cache coherency 
scheme defined in the Runway bus protocol. Each processor 
monitors all Runway transactions to determine whether the 
virtual index requested matches a line currently stored in its 
cache. This is called "snooping I he bus." A Runway proces- 
sor must own a cache line exclusively or prinitcly before il 
can complete a store. Once the store is complete, the cache 
line is considered dirty relative to the stale memory copy. 
To maximize Runway bus efficiency, processors are not re- 
quired to write this stale data back to memory immediately. 
Instead, the write-back operation occurs when the cache 
line location is required for use by the owning processor for 
another memory access. If, following the store but before 
the write-back, another processor issues a read of this cache 
line, the owning processor will snoop this read request and 
respond with a cache-fo-cache copy of the updated cache 
line data. This data is then stored in the requesting proces- 
sors cache and main memory. 

Since tiie I/O system must also read (output) and modify 
(input) memory data via DMA transactions, data consis- 
tency for the I/O system must be ensured as well. For exam- 
ple, if the I/O system is l eading dala from memory (for out- 
bound DMA) lhat is currently dirty in a processor's cache, it 
must be prevented from obtaining a stale, out-of-date copy 
from memory. Likewise, if the I/O system is writing data to 
memory (for inbound DMA), it must ensure that the proces- 
sor caches acquire I his update. The optimum solution not 
only maintains consistency by performing necessary input/ 
output operations while preventing the transfer of any stale 
copies of data, but also minimizes any interference with 
CPU cycles, which relate directly to performance. 

Cache coherence refers to this consistency of memory ob- 
jects between processors, memory modules, and I/O de- 
vices. HP ilOOO systems without coherent I/O hardware must 
rely on software to maintain cache coherency. At the hard- 
ware level, the I/O device's view of memory is different from 
the processor's because requested dala might reside in a 
processor's cache. Typically, processor caches are virtual!) 
indexed while I/O devices use physical addresses to access 
memory. Hence there is no way for I/O devices to participate 
in the processor's coherency protocol without additional 
hardware support in the I/O adapter. 

Some architectures have prevented stale dala problems by 
implementing physically indexed caches, so that it is the 
physical index, not the virtual index, that is snooped on the 
bus. Thus, the l/( ) system is not required to perform a physi- 
eal-io virtual address translation to participate in the snoopy 
Coherence protocol. ( in the IIP 9000 .J/K-class products, we 



chose to implement virtually indexed caches, since this 
minimizes cache lookup time by eliminating a virtual-to- 
physical address translation before the cache access. 

Other architectures have avoided the output stale data prob- 
lem by implementing write-through caches in which all pro- 
cessor stores are immediately updated in both the cache and 
the memory The problem with this approach is its high use 
of processor-to-memory bus bandwidth. Likewise, to resolv e 
the input stale data problem, many architectures allow 
cache lines to be marked as uncacheable, meaning that they 
can never reside in cache, so main memory will always liave 
correct data. The problem with this approach is that input 
data must first be stored into this uncacheable area and then 
copied into a cacheable area for the processor to use it. This 
process of copying the data again consumes processor-to- 
memory cycles for nonuseful work. 

Previous implementations of HP's PA-RISC processors cir- 
cumvent these problems by making caches visible to soft- 
ware. On outbound DMA, the software I/O device drivers 
execute flush data cache instructions immediately before 
output operations. These instructions are broadcast to all 
processors and require them to flush their caches by writing 
the specified dirty cache lines back to main memory. After 
the DMA buffer has been Hushed to main memory, the out- 
bound operation can proceed and is guaranteed to have the 
most up-to-date data. On inbound DMA. the software I/O 
device drivers execute broadcast purge data cache instruc- 
tions just before the input operations to remove the DMA 
buffer from all processor caches in the machine. The PA- 
RISC architecture's flush tlata cache and purge data cache 
instruction overhead is small compared to the performance 
impact incurred by these other schemes, and the I/O hard- 
ware remains ignorant of the complexities associated with 
coherency. 

I/O Adapter Requirements. The IIP 9000 J/K-class products 
and the generation of processors they are designed to sup- 
port place greater demands on the I/O hardware system, 
ultimately requiring the implementation of cache coherent 
I/O hardware in the I/O bus adapter, which is the bus con- 
verter between the Runway processor-memory bus and the 
HP-HSC I/O bus. The first or these demands for I/O hard- 
ware cache coherence came from the design of the PA 7200 
and PA S000 processors and their respective implementa- 
tions of cache prefetching and speculative execution. The 
introduction of these features would have required software 
l/< > i lev ice drivers t" purge inbound buffers twice, MOB be* 
fore the DMA and once after the DMA completion, thus 
doubling the performance penalty. Because aggressive pre- 
fetches into the DMA region could have accessed stale dala 
after the purge but before the DMA. the second purge would 
have been necessary after DMA completion to cleanse stale 
data prefetch buffers in the processor. By designing address 
translation capabilities into the l/( I adapter, we enable it to 
participate in the Runway snoopy protocol. By generating 
virtual indexes, the I/O adapter enables processors to com- 
pare and detect collisions with current cache addresses and 
to purge prefetched data aggressively before it becomes 
stale. 

Another demand on the I/O adapter pushing it in the direc- 
tion of I/( ) cache coherence came from the IIP 9000 J/K-class 
memory controller. It was decided that the new memory 
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controller would noi implement writes of less than four 
words. (These types of writes would have required read- 
modify-write operations in the DRAM array, which have long 
cycle times and, if executed frequently, degrade overall main 
memory performance.) Because one-word writes occur in 
the I/O system, for registers, semaphores, or short DMA 
writes it was necessary that the I/O adapter implement a 
one-line-deep cache to buffer cache lines, so that these one- 
word writes could be executed by performing a coherent 
read private transaction on the Runway bus. obtaining the 
most recent copy of the cache line, modifying it locally in 
cache, and finally writing the modified line back lu main 
memory For the I/O adapter to support a cache on the Run- 
way bus. it has to have the ability to compare processor-gen- 
erated virtual address transactions with the address con- 
tained in its cache to ensure that the processors always 
receive the most tip-to-dale data. 

To simplify the design, the I/O adapter implements a subset 
of the Runway bus coherency algorithm. If a processor re- 
quests a cache line currently held privately by the I/O 
adapter, the I/O adapter stalls the coherency response, fi- 
nishes the read-modify-write sequence, writes the cache line 
back to memory, and then responds with C0H_0K, meaning 
that the I/O adapter does not have a copy of this cache line. 
This was much simpler to implement than the processor 
algorithm, which on a ( (.inflict responds with C0H_CPY, mean- 
ing that the processor has this cache line and will issue a 
cache-to-cache copy after modifying the cache line. Since 
the I/O adapter only has a one-line cache and short DMA 
register and semaphore writes are infrequent, it was felt that 
the simpler algorithm would not be a performance issue. 

A final requirement for the I/O adapter is that it handle 32-bit 
I/O virtual addresses on the HP-HSC bus and a larger 40-bit 
physical address on the Runway bus to support the new 
processors. Thus a mapping function is required to convert 
all HP-HSC DMA transactions. This is done via a lookup 
table in memory, set up by the operating system, called the 
I/O page directory. With minor additional effort, I/O page 
directory entries were defined to provide the I/O adapter 
with not only the 40-bit physical address, but also the soft- 
ware virtual index. This provides all the information neces- 
sary for the I/O adapter to be a coherent client on the Run- 
way bus. The I/O adapter team exploited the mapping 
process by implementing an on-chip TLB (translation looka- 
side buffer), which caches the most recent translations, to 
speed the address conversion, and by storing additional at- 
tribute bits into each page director,' entry to provide clues 
about the DMAs page destination, thereby allowing Rather 
optimization for each HP-HSC-to-Runway transaction. 

I/O TLB Access. The mechanism selected for accessing the I/O 
TLB both minimizes the potential lor thrashing and is flex- 
ible enough to work with both large and small I/O systems. 
(Thrashing occurs when two DMA streams use the same 
TLB RAM location, each DMA transaction alternately cast- 
ing the other out of the TLB. resulting in tremendous over- 
head and lower performance.) Ideally, each DMA stream 
should use a different TLB RAM loc ation, so that only one 
TLB miss read is done per page of DMA. 

We implemented a scheme by which the upper 20 bits of the 
I/O virtual address are available to be divided into a chain ID 
and a block ID (Fig. 2). The lower 12 bits of the address 
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Fig. 2. TLB translation scheme. The upper address bits (chain IV) 
of the l/l i virtual address are used to access the TLB RAM, and the 
remainder of the I/O virtual address (block ID ) is used to verify a 
TLB hit (as a tag). 

must be left alone because of the 4K-byte page size defined 
by the architecture. The upper address bits (chain ID) of the 
I/O virtual address are used to access the TLB RAM. and the 
remainder of the I/O virtual address (block ID) is used to 
verify a TLB hit (as a tag). This allows software to assign 
each DMA stream to a different chain ID and each 4K-byte 
block of the DMA to a different block ID. thus minimizing 
thrashing between DMA streams. 

A second feature of this scheme is that it helps limit the 
overhead of the I/O page directory. Recall that the I/O page 
directory contains all active address translations and must 
be memory-resident. I/O page directory size is equal lo the 
size of one entry times 2 k . where k is the number of chain ID 
bits plus the number of block ID bits. The division between 
the chain ID and the block ID is programmable, as is the 
total number of bits ( k). so software can reduce the memory 
overhead of the I/O page directory for systems with smaller 
I/O subsystems if we guarantee that the leading address bits 
are zero for these smaller systems. 

If the translation is not currently loaded in the I/O adapters 
I/O TLB, the I/O adapter reads the translation data from the 
I/O page directory and then proceeds with the DMA. Servic- 
ing the TLB miss dues not require processor intervention, 
although the I/O page directory entry must be valid before 
initialing the DMA. 

Attribute Bits. Each mapping of a page of memory has attrfb- 
tite bits (or dues) that allow some control over how the 
DMA is performed. The page attribute bits control the en- 
abling and disabling of prefetch for reads, the enabling and 
disabling of atomic mode, and the selection of fast DMA or 
safe DMA. 

Prefetch enable allows the I/O adapter to prefetch on DMA 
reads, thus improving outbound DMA performance Because 
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the I/O adapter does not maintain coherency on prefetch 
data, software must only enable prefetching on pages where 
there will be no conflicts with processor-held cache data. 
Prefetch enable has no effect on inliound DMA writes. 

Atomic or locked mode allows a DMA transfer to own all of 
memory. While an atomic mode transfer is in progress, pro- 
cessors cannot access main memory- This feature was 
added to support PC buses thai allow locking (ISA. EISA) 
The HP-HSC bus also supports this functionality, hi almost 
all cases, atomic mode is disabled, because it has tremen- 
dous performance effects on the rest of the system. 

The fast/safe bit only has an effect on half-cache-line DMA 
writes. However, many I/O devices issue this type of 16-byte 
write transaction. In safe mode, the write is done as a read- 
modify-write transaction in the I/O adapter cache, which is 
relatively low in performance. In fast mode, the write is 
issued as a WRITE16_PURGE transaction which is interpreted 
by the processors as a purge cache line transaction and a 
write half cache line transaction to memory. The fast/safe 
DMA attribute is used in the following way. In the middle of 
a long inbound DMA. fast mode is used: the processor's 
caches are purged while DMA data is moved into memory. 
This is acceptable because the processor should not be 
modifying any cache lines since the DMA data would over- 
write the cache data anyway. However, at the beginning or 
end of a DMA transfer, where the cache line might be split 
bel ween the DMA sequence and some other data unrelated 
to the DMA sequence, the DMA transaction needs to pre- 
serve this other data, which might be held private-dirty by a 
processor. In such cases, the safe mode is used. This feature 
allows the vast majority of 16-byte DMA writes to be done as 
WRITE16_ PURGEs, which have much better performance than 
read-modify-writes internal to the I/O adapter cache. This is 
the only half-cache-line transaction the memory subsystem 
supports. All other memory transactions operate on full 
cache lines. 

I1P-1 ! X Implementation 

( ache coherent I/O affects HP-l'X I/O device drivers. 
Although the specific algorithm is different for each soft- 
ware I/O device driver that sets up DMA transact ions, the 
basic algorithm is the same. For outbound DMA on a system 
without coherent I/O hardware, the driver must perform the 
following tasks: 

Flush the data caches to make sure memory is consistent 
with the processor caches. 

Convert the processor virtual address if) a physical address. 
Modify and flush any control structures shared between the 
driver and the I/O device. 

Initiate the DMA transfer by programming the device to 
move the data from the given physical address, using a 
device-specific mechanism. 

Fur inbound DMA the algorithm is similar: 

Purge the data cache to prevent stale cache data from being 

written over DMA data 

Convert the processor virtual address to a physical address. 
Moilify and flush any control structures shared between the 
driver and the I/( ) device. 

Initiah' the DMA transfer by programming the device to 
move the data to the given physical address, using a device- 
specific algorithm 



When the DMA completes, the device notifies the host pro- 
cessor via an interrupt, the driver is invoked to perform any 
post-DMA cleanup, and high-level software is notified that 
the transfer has completed. For inbound DMA the cleanup 
may include purging the data buffer again in case the pro- 
cessor has prefetched the old data. 

To support coherent I/O hardware, changes to this basic 
algorithm could not be avoided. Since coherent I/O hard- 
ware translates 32-bit I/O virtual addresses into processor 
physical addresses that may be wider than 32 bits. I/O de- 
vices must be programmed for DMA to I/O virtual addresses 
instead of physical addresses. Also, since coherency is main- 
tained by the coherent I/O adapter, no cache flushes or 
purges are necessary and should be av oided for performance 
reasons. To allow drivers to function properly regardless of 
whether the system has coherent I/O hardware. HP-l'X ser- 
vices were defined to handle the differences transparently. 
There are three main services of interest to drivers: map!) is 
used to convert a virtual address range into one or more I/O 
virtual addresses, unmapll is used to release the resources 
obtained via map!) once the transfer is complete, and 
dma_syncl) is used to synchronize the processor caches with 
the DMA buffers, replacing explicit cache flush and purge 
services. These services are discussed in more detail below. 

Drivers had to be modified where one of the following 
assumptions existed: 

• Devices use processor physi<-al addresses lo access memory. 
Tliis assumption is still true for noncoherent systems, but 
on HP 9000 .1/K-class systems I/O virtual addresses must be 
used. The mapO service transparently returns a physical 
address on noncoherent systems and an I/O virtual address 
on coherent systems. 

• < m hi mnnayrmoil must be performed by sofhrare. This is 
still true for noncoherent systems, but on coherent systems 
flushes and purges should be avoided for performance rea- 
sons. The dma_sync() service performs the appropriate cache 
synchronization fimctions on noncoherent systems but does 
not Hush or purge on coherent systems. 

• The driver does not have lo keep I rack of any DMA re- 
sources. Drivers now have lo remember what I/O virtual 
addresses were allocated to them so they can call unmapll 
when the DMA transfer is complete. 

To accommodate these necessary modifications, the soft- 
ware model for DMA setup has been changed to: 

• Synchronize the caches using dma_sync(). 

• Convert the processor virtual address to an I/O virtual 
address using the map!) service 

• Initiate the DMA transfer via a device-specific mechanism. 

• Call the iinmapO service to release DMA resoiu-ces when the 
DMA transfer is complete. 

On noncoherent systems, this has the same effect as before, 
except that the driver doesn't know whether or not the 
cache was actually flushed 01 whether the l/( ) virtual address 
is a physical address. 

For drivers that rely entirely on existing driver services to 
set up DMA buffers (like- most EISA drivers), no changes 
were needed since the underly ing services were modified to 
support coherent I/O. 
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Driver Services: map and unmap. The map!) service and its vari- 
ants are the only way to obtain an I/O virtual address for a 
given memory object. Drivers cannot assume that a buffer 
can be mapped in a single call to map() unless the buffer is 
aligned on a cache line boundary and does not cross any 
page boundaries. Since it is possible thai multiple I/O virtual 
addresses are needed to map a DMA buffer completely, map!} 
should be called within a loop as shown in the following C 
code fragment: 

/" Function to map an outbound DMA buffer 
" Parameters: 

* isc: driver control structure 

* spacejd: the space id for the buffer 
virt_addr: the virtual offset for the buffer 
bufferjength: [he length (in bytes) of the buffer 
iovecs (output): an array of address/length I/O 

virtual address pairs 

* Output: 

Returns the number of address/length I/O virtual 
address pairs (-1 if error) 

7 

int my_buffer_mapper(isc,space_id,virt_addr,buffer_length, iovecs) 

struct isc_table_type *isc; 

int vec_cnt: 

struct lovec 'iovecs: 

{ 

int vec_cnt = 0; 
struct iovec hostvec: 
int retval, 

/* Flush cache (on noncoherent systems) 7 
dma_synclspaceJd,vin_addr,bufferJength,IO_SYNC_FORCPU); 

/" Setup input for map() */ 
hostvec->iov_base = virt_addr; 
hostvec->iov_len = bufferjength; 

do{ 

/• Map the buffer */ 

retval = wsio_map(isc,NULL.O.spaceJd,virt_addr, 
Shostvec, iovecs); 

if (retval >= 0){ 

/* Mapping was successful point to the 

* next I/O virtual address vector. Note: 

* hostvec was modified by mapOto point to 

* unmapped portion of the buffer. 
7 

vec_cnt++; 
iovecs~+: 

} 

) while (retval > D); 
/* Check for any errors 7 
if (retval < 0|{ 
while (vec_cntl{ 

wsio_unmap(isc, iovec[vec_cnf|.iov_base); 

vec_cnt — ; 

1 

vec_cnt — ; 
} 

return(vec_cnt); 

} 

In this case the mapll variant wsio_mapll is used to map a 
buffer. When a mapping is successful, the driver can expect 
that the virtual host range structure has been modified to 



point lo the unmapped portion of the DMA buffer. The 
wsio_map() service just converts the isc parameter into the 
appropriate token expected by map!) to find I he control 
structure for the correct I/O page directory. The calling con- 
vention for mapll is: 

mapltoken,map_cb,hints,range_type,host_range,io_range), 

where token is an opaque value that allows map() lo find the 
bookkeeping structures for the correct I/O page directory 
The map_cb parameter is an optional parameter that allows 
map() to store some state information across invocations. It 
is used to optimize the default I/O virtual address allocation 
scheme (see below). The hints parameter allows drivers to 
specify page attributes to be set in the I/O page directory or 
to inhibit special handling of unaligned DMA buffers. The 
host_range contains the virtual address and length of the 
buffer to be mapped. As a side effect, the host_range is modi- 
fied by map!) to point to the unmapped portion of the buffer 
(or the end of the buffer if the entire range was mapped). 
The io_range is set up by mapll to indicate the I/O virtual 
address and the length of the buffer that was just mapped. 
The rangejype is usually the space ID for the virtual address, 
but may indicate thai the buffer is a physical address. 

All I/O virtual addresses allocated via map!) must be deallo- 
cated via unmapO when they are no longer needed, either 
because there was an error or because the DMA completed. 
The calling convention for unmapO is: 

unmap(token,io_range). 

The map!) service uses the following algorithm to map mem- 
ory objects into I/O virtual space: 

• Allocate an I/O virtual address for the mapping. 

• Initialize the I/O page directory entry with the appropriate 
page attributes. The page direclory entry will be brought 
into the I/O translation lookaside buffer when there is a 
miss. 

• Update the caller's range structures and return the number 
of bytes left to map. 

The first two steps are discussed separately below. 

I/O Virtual Address Allocation Policies. As mentioned above, if 
several DMA streams share a chain ID, there is a risk that 
performance will suffer significantly because of thrashing. 
Two allocation schemes tiiat appeared to eliminate thrashing 
are: 

• Allocate a unique chain ID to every DMA stream. 

• Allocate a unique chain ID to every HP-HSC guest. 

In the I/O adapter there are a total of 25(5 I/O translation 
lookaside buffer entries, and therefore there are 256 chain 
IDs to allocate. The first allocation scheme is unrealistic 
because many more than 256 DMA streams can be active 
from the operating system's point of view. For instance, a 
single networking card can have over 100 buffers available 
for inbound data at a given time, so with only three network- 
ing cards the entire I/O TLB would be allocated. Thrashing 
isn't really a problem unless individual transactions are in- 
terleaved with the same chain ID. so on the surface it may 
appear that the second allocation scheme would do the trick 
(since most devices interleave different DMA streams at fairly 
coarse granularity like 512 or IK bytes). Unfortunately, the 
second scheme has a problem in that some devices (like SCSI 
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adapters) can have many large DMA buffers, so all current 
outstanding DMA streams cannot be mapped into a single 
chain ID. One of the goals of the design was to minimize the 
impact on drivers, and many drivers had been designed with 
the assumption that there were no resource allocation prob- 
lems associated with setting up a DMA buffer. Therefore, it 
was unacceptable to fail a mapping request because the 
drivers chain ID contained no more free pages. The book- 
keeping involved in managing the fine details of the individ- 
ual pages and handling overflow cases while guaranteeing 
that mapping requests would not fail caused us to seek a 
solution that would minimize (rather than eliminate) the 
potential for thrashing, while also minimizing the bookkeep- 
ing and overhead of managing the chain ID resource. 

What we finally came up with was two allocation schemes: a 
default I/O virtual address allocator which is well-suited to 
mass storage workloads (disk. tape, etc.) and an alternate 
allocation scheme for networking-like workloads. It was 
observed early on that there are some basic differences in 
how- some devices behave with regard to DMA buffer man- 
agement. iNetworking drivers tend to have many buffers 
allocated for inbound DMA but devices tend 10 access them 
sequentially. Therefore, networking devices fit the model 
very well for the second allocation scheme listed at the be- 
ginning of this section, except that it is likely that multiple 
chain IDs will be necessary for each device because of the 
number of buffers that must be mapped at a given time. 
Mass storage devices, however, may have many DMA buff- 
ers posted to the device, and no assumptions can be made 
about the order in which the buffers will be used. This be- 
havior was dubbed nonsequential. It would have resulted in 
excessive overhead in managing the individual pages of a 
given chain ID if the second scheme listed at the beginning 
of this section had been implemented. To satisfy the require- 
ments for both sequential and nonsequential devices, it was 
decided to manage rirttial chain IDs called nint/cs instead 
of chain IDs. This allows the operating system to manage 
the resource independent of the physical size of the I/O 
translation lookaside buffer. Thrashing is minimized by al- 
ways allocating free ranges in order, so thai thrashing can- 
not occur unless 256 ranges are already allocated. There- 
fore, software has a slightly different view of the I/O virtual 
address than the hardware, as shown in Fig. 3. 

Willi this definition of an I/O virtual address* software is noi 
restricted to 256 resources, but instead can configure the 
number; Of resources by adjusting the number of ranges per 
chain ID. For Ihe HP-l'X 10.0 implementation, there are 
eight pages per range so that up to 32K bytes can be mapped 
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with a single range. The main allocator keeps a bitmap of all 
ranges, but does not manage individual pages. 

A mass storage driver will have one of these ranges allo- 
cated whenever it maps a 32K-byte or smaller buffer (assum- 
ing the buffer is aligned on a page). For large buffers ( > 32K 
bytes), several ranges will be allocated. When the DMA 
transfer is complete, the driver unmaps the buffer and the 
associated ranges are returned to the pool of free resources. 

All drivers use the default allocator unless they notify the 
system that an alternate allocator is needed. A driver service 
called set_artnbutes allows the driver to specify that it con- 
trols a sequential device and therefore should use a different 
allocation scheme. In the sequential allocation scheme used 
by netw orking drivers, the driver is given permanent owner- 
ship of ranges and the individual pages are managed (similar 
to the second scheme above). When a networking driver 
attempts to map a page and the ranges it owns are all in use, 
the services use the default allocation scheme to gain own- 
ership of another range. Unlike the default scheme, these 
ranges are never returned to the free pool. When a driver 
unmaps the DMA buffer, the pages are marked as available 
resources only for that driver. 

When unmapO is called to unmap a DMA buffer, the appropri- 
ate allocation scheme is invoked to release the resource. If 
the buffer was allocated via the default allocation scheme 
then unmapO purges the I/O TLB entry using the I/O adapter's 
direct write command to invalidate the entry. The range is 
then marked as available. If the sequential allocation 
scheme is used then the I/O TLB is purged using the I/O 
adapter's purge TLB command each time unmapll is called. 

I/O Page Directory Entry Initialization. ( nice mapl) has allocated 
the appropriate I/O virtual address as described above, it 
will initialize the corresponding entry in the I/O page direc- 
tory. Pig. 4 shows the format of an I/O page directory entry. 
To fill in the page directory entry, mapll needs to know the 
physical address, the virtual index in the processor cache, 
whether the driver will allow the device to prefetch, and the 
page type attributes. The physical address and virtual index- 
are both obtained from the rangejype and host_range parame- 
ters by using the LPA and LCI instructions, respectively. The 
LCI (load coherence index) Instruction was defined specifi- 
cally for coherent I/O services to determine the virtual index 
of a memory object. The page type attributes are passed to 
map!) via the hints parameter. Hints thai the driver can specify 
are: 

• I0_SAFE. Causes the safe page attribute to be set. 
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IO_LOCK. Causes the lock (atomic) page attribute to be set. 
I0_N0_SEQ. Causes the prefetch enable page attribute to In- 
cleared 

Refer to "Attribute Bits" on page 54 for details of how the 
I/O adapter behaves for each of these driver hints. Once the 
I/O pace directory has been initialized, the buffer can be 
used for DMA. 

Other hints are: 

IOJGN_ALIGNMENT. Normally the safe bit will be set automati- 
cally by mapll for buffers that are not cache line aligned or 
thai are smaller than a cache line. This flag forces map!) to 
ignore cache line alignment. 

I0_C0NTIGU0US. This flag tells map() that the whole buffer 
must be mapped in a single call. If mapll cannot map the 
buffer contiguously then an error is returned (this hint im- 
plies IO_IGN_ALIGNMENT). 

Driver Impact. The finite size of the I/O page directory used 
for address translations posed some interesting challenges 
for drivers. 

Drivers must now release DMA resources upon DMA com- 
pletion. This requires more state information than drivers 
had previously kept. Additionally, some drivers had pre- 
viously set up many ( thousands) of memory objects, which 
I/O adapters need to access. Mapping each of these objects 
into the I/O page directory individually could easily consume 
thousands of entries. Finally, the default I/O page directory 
allocation policies would allocate several entries to a driver 
at a time, even if the driver only requires a single translation. 

In the case where drivers map hundreds or thousands of 
small objects, the solution requires the driver code to be 
modified to allocate and map large regions of memory and 
then break it down into smaller objects. For example, if a 
driver individually allocates and maps 128 32-byte objects, 
this would require at least 128 I/O page directory entries. 
However, if the driv er allocates one page (4096 bytes) of 
data, maps the whole page, and then breaks it down into 
128 32-byte objects, only one I/O page directory entry is 
required. 

Another solution is to map only objects for transactions that 
are soon to be started. Some drivers have statically allo- 
cated and mapped many structures, consuming large num- 
bers of I/O page directory entries, even though only a few- 
DMA transactions were active at a time. Dynamically map- 
ping and unmapping objects on the fly requires extra ( IT 
bandwidth and driver state information, but can substan- 
tially reduce I/O page directory utilization. 

Networking-Specific Applications 

The benefits of the selected hardware I/O cache coherence 
scheme become evident when examining networking appli- 
cations. 

High-speed data communication links place increased de- 
mands on system resources, including CPD, bus. and mem- 
ory, which must carry and process the data. Processing the 
data that these links cany and the applications for which 
they will be used requires that resource utilization, mea- 
sured both on a per-byte and on a per-packet basis, be 
reduced. Additionally, the end-to-end latency (the time it 
takes a message sent by an application on one system to be 



received by an application on another system ) must be 
reduced from hundreds of microseconds to tens of micro- 
seconds. 

Cache coherent I/O, scatter-gather I/O, and copy-on-write 
I/O all offer reduced resource consumption or reduced la- 
tency or both. They do this by reducing data movement and 
processor stalls and by simplifying both the hardware and 
software necessary to manage complex DMA operations. 

Cache coherent I/O reduces the processor, bus, and memory 
bandwidth required for each unit of data by eliminating the 
need for the processors to manage the cache and by reduc- 
ing the number of times tlata must cross the memory bus. 
The processor cycles saved also help to reduce per-packet 
latency. 

The I/O adapter's address translation facility can be used to 
implement scatter-gather I/O for I/O devices that cannot 
efficiently manage physically noncontiguous buffers. Pre- 
viously, drivers needed to allocate large, physically eont igu- 
ous blocks of RAM for such devices. For outbound I/O, the 
driver would copy the outbound data into such a buffer. The 
mapping facility allows the driver to set up virtually contigu- 
ous mappings for physically scattered, page-sized buffers. 
The l/( ) device's view of the buffer is then contiguous. This 
is done by allocating the largest range that the I/O mapping 
services allow (32K bytes, currently), then using the remap!) 
facility to set up a translation for each physical page in a 
DMA buffer. Using this facility reduces the processing and 
bus bandwidth necessary, and the associated latencies, for 
moving noncontiguous data. Requiring only a single address/ 
length pair, this facility can also be used to reduce the pro- 
cessing necessary to set up DMAs for, and the latencies im- 
posed by, existing scatter-gather I/O mechanisms that 
require an address/length pair for each physical page. 

Cache coherent I/O can be used to achieve tine copy-on- 
write functionality. Previously, even for copy-on-write data, 
the data had to be flushed from the data cache to physical 
memory, where the I/O device could access the data. This 
flush is essentially a copy. Cache coherent I/O, by allowing 
the I/O device to access data directly from the CPU data 
caches, eliminates processing time and latency imposed by 
this extra copy. The hardware can now support taking data 
straight from a user's data buffer to the I/O device. 

To take advantage of the optimal page attributes where pos- 
sible (e.g., I0_FAST for inbound DMA buffers) while ensuring 
correct behavior for devices that require suboptimal memory 
accesses such as I/O semaphore or locked (atomic) memory 
transactions, the mapping facility can be used to alias multi- 
ple I/O virtual addresses to the same physical addresses. 
Some software DM\ programming models place DMA con- 
trol information immediately adjacent to the DMA buffer. 
Frequently, this control information must be accessed by the 
I/O device using either read-modify-vvrite or locking behavior. 
By mapping the same page twice, once as I0_SAFE and again 
as I0_FAST, and providing the I/O device with both I/O virtual 
addresses, the device can access memory using optimal 
ICLFAST DMA for bulk data transfer and I0_SAFE DMA for 
updating the control structures. 

Finally, through careful programming, it is possible to take 
adv antage of I0_FAST coherent I/O, and to allow the driver to 
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maintain coherence for the occasional noncoherent update. 
For example, it is possible for the driver to flush a data 
structure explicitly from its cache, which will later be par- 
tially updated through a write-purge transaction from the 
adapter. This has the advantage of allowing the adapter to 
use its optimum form of DMA. while allowing the driver to 
determine when coherency must be explicitly maintained. 

Performance Results 

To collect performance results, the SPECT SFStt (LADD1S) 
1.1 benchmark was run on the following configuration: 

• 4-way HP 9000 Model K410 
. lG-byte RAM 

• 4 FW-SCSI interfaces 

• 56 file systems 

• 4 FDDI networks 

• I1P-1"X 10.01 operating system 

• 132 NFS daemons 
. 8 NFS clients 

• 15 load-generating processes per client. 

The noncoherent system achieved 4255 NFS operations per 
second with an average response time of 3(3.1 ms/operation. 
The coherent system achieved 4651 NFS operations per sec- 
ond with an average response time of 32.4 ms/operation. 

The noncoherent system was limited by the response time. 
It's likely that with some fine-tuning the noncoherent system 
could achieve slightly more throughput. 

To compare the machine behavior with and without coherent 
I/O, CPU and I/O adapter measurements were taken during 
several SFS runs in the configuration described above. The 
requested SFS load was 4000 NFS operations per second. 
This load level was chosen to load the system without 
hitting any known bottlenecks. 

Comparing the number of instructions executed per NFS 
operation, the coherent system showed a 4% increase over 
the noncoherent system, increasing to 40.100 instructions 

t SPEC stands lot Systems Performance Evaluation Cooperative, an indusliy-standard bench- 
marking consortium. 

tt Tlie SPEC SFS Benchmark measures a system's distributed tile system (NFS) performance. 



from 38.500. This increase was because of the overhead of 
the mapping calls. If we assume an average of 1 1 map'unmap 
pairs per NFS operation, then each pair costs about 145 
instructions more than the alternative broadcast flush/purge 
data cache calls. 

The degradation in path length was offset by a 17% improve- 
ment in CP1 (cycles per instruction). CP1 was measured at 
2.01 on the coherent system and 2.42 on the noncoherent 
system. 

The overall result was a 13% improvement in CPU instruc- 
tion issue cycles per NFS operation. The coherent system 
used 80.700 CPl" cycles per operation, while the nonco- 
herent system needed 93,300 cycles. 

To determine the efficiency of the software algorit hms that 
manage the I/O TLB and to evaluate the sizing of the TLB. 
the number of I/O TLB misses was measured during these 
SFS runs. Under an SFS load of 4000 NFS operations per 
second, the disk drives missed 1.30 times per NFS operation, 
or 0.64% of all accesses. 
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A 1.0625-Gbit/s Fibre Channel Chipset 
with Laser Driver 



This chipset implements the Fibre Channel FC-0 physical layer 
specification at 1 .0625 Gbits/s. The transmitter features 20:1 data 
multiplexing with a comma character generator and a clock synthesis 
phase-locked loop, and includes a laser driver and a fault monitor for 
safety. The receiver provides the functions of clock recovery, 1 :20 data 
demultiplexing, comma character detection, and word alignment, and 
includes redundant loss-of-signal alarms for eye safety. A single-chip 
version with both transmitter and receiver integrated is designed for disk 
drive applications using the Fibre Channel arbitrated loop protocol. 

by Justin S. Chang, Richard Dugan, Benny W.H. Lai, and Margaret M. Nakamoto 



The information revolution has pushed the dataromm world 
to gigabit rates. Hewlett-Packard's G-Link chipset 1 (HDMP- 
1000 ) helped paved the way for low-cost gigabit technology. 
Since that debut, important standards such as Fibre Channel 
(FC) have incorporated gigabit rates in their documents. HP 
now offers a low-cost solution for Fibre Channel applica- 
tions with the HDMP-1512 and HDMP-1514 transmitter and 
receiver chips, respectively. 

The chipset implements the physical layer interface as de- 
fined in Fibre Channel specification FC-0. 2 Both the trans- 
mitter and the receiver use a "bang-bang" phase-locked loop 
technique similar to the G-Link chipset. Since the standard 
allows the use of either copper or fibre media, the transmit- 
ter has an integrated CD (compact disk) laser driver in addi- 
tion to two 50-ohm cable drivers. Out of concern for eye 
safety, the standard requires the chipset to include certain 
monitors and controls CO interface to an open fibre control 
(OFC) chip, so the chipset includes a laser fault monitor and 
loss-of-signal alarms. 

The chipset's speed is selectable: either 1062.5 Mbits/s or 
531.25 Mbits/s. To conserve power, the receiver chip imple- 
ments a demultiplexing scheme that allows the use of lower- 
speed and lower-power cells to recover the parallel data. A 
special selectable "ping-pong" mode for the parallel TTL bus 
helps reduce switching noise on the supply lines. The upper 
and lower 10 bits are shifted by half a dock cycle relative to 
each other when ping-pong mode is active. 

Boi h ciiips are implemented using a proprietary HP deuce 
array based on the IIP25 bipolar process, a 25-GHz fr pro- 
cess. The array concept not only allowed quick design cycle 
times but also enabled the fabrication of a single-chip trans- 
ceiver that integrates both the transmitter and the receiver. 
The 10-bit transceiv er design heavily leveraged the cells of 
the chipset. The transceiver is designed for disk drive appli- 
cations using the Fibre Channel arbitrated loop (FC-AL) 
protocol. 



System Configurat ion 

The chipset is designed for use in a Fibre Channel optical 
module. Fig. la shows a typical system configuration. There 
are three ICs: the transmitter, the receiver, and the OFC 
(open fiber control) chip. The transmitter and receiver chips 
use a system frame clock (53. 125 MHz ) for transmission and 
to assist in meeting the data lock time. The transmitter uses 
an external p-n-p power device to handle the potentially large 
laser currents — as large as 130 mA. The OFC controller 
monitors link status lines from the transmitter and receiver 
chips to handle the safety protocol and the link startup 
sequence as described in the standard. A photograph of the 
assembled module is shown in Fig. lb. 

Transmitter Chip 

The transmitter block diagram is shown in Fig. 2. It consists 
of three major blocks — the laser driver, the multiplexer, and 
the clock generator and phase-locked loop — plus a host of 
I/O and other supporting circuitry. 

Laser Driver 

The three distinct laser driver sections are the dc bias cir- 
cuit, the ac driver, and the safety circuit, as shown in Fig. 3. 
The laser driver is designed for anode bias configurations 
and operating rates up to 1.0025 Gbits/s. The dc bias circuit 
can handle optical devices t hat require up to 130 mA of bias 
current and have V, h as large as 2.3V. For 780-nm CD lasers 
operating at -3-dBni optical power out, the typical dc bias 
current is 40 mA and the monitor diode current is 400 uA. 
The ac driver provides a mhiimum modulation of 25 mA 
peak-to-peak into the laser. 

The dc bias circuit and the ac driver are decoupled for ease 
of adjustment. The decoupled scheme allows adjustment of 
either the average power or the modulation depth without 
affecting the other. Both of these settings are determined by 
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resistive elements. The safety circuit monitors fault condi- 
tions to ensure that the laser optical output power will not 
be at unsafe levels. 

DC Bias Circuit. Referring to Fig. 3, the dc bias of the laser is 
controlled by the operations] amplifier feedback loop. The op 
amp's positive input is internally set to 1.85V. It is ohtained 
through a voltage divider from the 2.42Y bandgap reference 



node (output of the bandgap reference circuit 3 ). IZBGTP. The 
negative input. LZMDF. is connected to a bias network con- 
trolled by the laser monitor diode. More current to the laser 
creates a larger monitor diode current, lowering the voltage 
on LZMDF. This results in a higher output voltage on LZDC. 
This decreases the \\ v of transistor PI. thereby lowering (lie 
laser bias current until the monitor diode node LZMDF and 
the internal reference node are again equal. The monitor 
diode lias a slow optical response (rise and fall times = 
10 ns); thus, it acts as a low-pass filter to improve stability. 

The gain through the op amp and the p-n-p transistor affects 
the accuracy of the loop in holding to the original setting. 
The op amp has a typical gain of 20 dB. Depending on the 
external components used, the total voltage loop gain is 
nominally 40 dB. This is adequate to hold the bias. Current 
gain is supplied by the external p-n-p transistor. 

AC Driver Circuit. The ac driver uses a differential collector- 
driven output configuration. The nominal output impedance 
is 50 olvms. The drive current is controlled by the external 
potentiometer. Pot 2. A temperature and supply compensated 
constant-current bandgap reference is used to bias the cur- 
rent source. The external resistor should have a low r temper- 
ature coefficient to minimize its effect on the ac drive as the 
temperature changes. The supply to the final output stage is 
made available to the user for filtering out supply noise. 

The equivalent ac impedance for the laser diode is on die 
order of 10 ohms. To assist the ac driver in driving current to 
the laser and not to the supply path, an 8-ohm resistor and 
an RF filter are used to increase the impedance looking into 
the dc bias network. 

Fig. 4 shows the optical eye pattern from a 780-nm CD laser. 
The typical 20-to-80% rise or fall time into a 25-ohm load is 
250 ps. The driver can also be adjusted to operate with 
I.'iOO-nm single-mode lasers. 

Laser Monitor and Safety Control. The built-in safety tircuil 
uses the monitor diode current to c heck for high optical 
output power. The circuit monitors the laser output for devi- 
ations larger than +10% from the nominal power setting. If 
the optical power is out of this window, the monitor starts 
the laser turn-off process. If the fault state continues for a 
time set by the error liming capacitor LZTC. the laser will be 
turned off. The circuit can then only be reset by cycling the 
laser-on pin, LZ0N. 

The laser safety circuit can be activated in different ways. In 
all cases, the laser is mined off by pulling a large current at 
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the LZMDF pin. causing the window detector lo sense a fault. 
This causes the LZDC output to go high and mm off transistor 
PI. At the same time, the ac driver is held in a static state. 
This is necessary because the ac circuit has enough outpul 
drive to pulse the laser without the dc bias current. 
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The window detector monitors the voltage on LZMDF. The 
high and low levels are set at 1 ,85V +10%. This translates 
directly to monitoring whether the optical output power has 
deviated more than ±10% from the nominal setting. If LZMDF 
goes out of this range, the charge on the LZTC capacitor will 
be discharged by a few hundred microamperes of current. If 
the fault continues and the voltage lowers to the fault value 
( 1.3V). then the error detector cell will output a TTL high 
level on the LZF pin and turn off the dc bias. The error time is 
set by the capacitance between LZTC and ground. This will be 
a few milliseconds for a 0.1-uF capacitor. 

There is also a bandgap monitor cell which checks for gross 
faults with the 2.42V bandgap. Because this bandgap is used 
in setting the window range, a Change will not necessarily 
cause the window detector to sense a fault. The bandgap 
monitor uses the V l( . voltage as a reference to sense when 
the bandgap is higher than 3.0V or lower than 2.0V. The 3.0V 
translates into a maximum 2x increase in optical output 
power before the laser is turned off. 

Once a fault has been delected, this condition is latched 
until the laser driver is reset using the LZ0N pin. A TTI. high 
on LZ0N will charge the LZTC capacitor while holding the 



Fig. 4. 7so run CD laser eye pattern (-:) dBm). 
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laser outpul off. When LZON is set low again, all laser cir- 
cuitry is enabled. 

Transmitter Multiplexer 

The block diagram of the transmitter multiplexer is shown 
in Fig. 5. In I he normal 1062.5-Mbit/s mode, sliift registers 
are loaded with the 20-bit-wide parallel data and then shifted 
(0 form the high-speed output. To conserve power, an inter- 
lacing method is used to allow ihe shift registers to operate 
at half speed. These registers are separated into two banks 
of len and are loaded with the proper bit order. The outputs 
of the two banks are then combined with one high-speed D 
flip-flop. In the 531.25-Mbit/s mode, only 10 hits are loaded 
into a single bank and the second batik is ignored. 

The data byte 1 inputs are inserted with a set of latches, 
allowing this data to be shifted by one-half bit relative to 
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Fig. 5. Transmitter multiplexer 
block. 



data byte 0. This configuration allows for the ping-pong 
mode of operation, in which the two input bytes are time- 
shift ed by one-half bit to minimize possible switching noise. 

An extra feature of the transmitter is "comma" character 
generation. When this mode is asserted, the K28.5 character 
(001 1 1 1 1010) is loaded into the shift registers. This is partic- 
ularly helpful in the evaluation phase of the chipset for byte- 
aligning the receiver without the higher-order FC-1 and FC-2 
chips. 

Transmitter Clock Generator 

The logic block diagram of the transmitter clock generator 
is shown in Fig. fi. It takes Ihe input from die VCO at 1062.5 
MHz and derives the necessary clocks for the multiplexer. 
The clock rate reduction involves serial divisions of 2, 2, and 
•". for the 10(!2.5-.\Ihii/s (20-bit) mode and 2, 5, and 2 for the 
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531.25-Mbit/s (10-bit) mode. Tlie divide-by-5 function is last 
for die 10fi2.o-Mbit7s mode, allowing ii lo operate at die 
slower speed to reduce power. All of the clocks are retimed 
by ihe high-speed clock to ensure proper clock alignment 

Transmitter Phase-Locked Loop 

Tlte phase-locked loop is a bang-bang type and is able to 
lock onto the reference clock at 53. 125 MHz. It consists of a 
modified sequential detector, a charge pump integrator, a 
VCO, and the clock generator. The detector, integrator, and 
VCO were leveraged front the G-Link chipset. 1 The nominal 
bang-bang time of the VCO is 1 ps per cycle. 



Receiver Chip 

The block diagram of the receiver chip is shown in Fig. 7. 
It consists of the demultiplexer, the phase-locked loop and 
clock generator, redundant loss-of-signal (LOS) detectors, 
and other I/O and supporting circuits such as the power-on 
supervisor. 

Receiver Demultiplexer 

The logic diagram of the demultiplexer is shown in Fig. 8. In 
addition to providing the serial-to-parallel conversion of the 
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input bit stream, it also detects the comma character 
(001 1 1 1 Ixxx I for proper frame alignment, as required by the 
Fibre Channel standard. Once the comma character is de- 
tected, a reset signal is sent to the clock generator for syn- 
chronization. 

To minimize power consumption, an interlacing method of 
demultiplexing is used in the receiver chip. The high-speed 
data stream is first deciphered into two streams at half the 
rate, and these are loaded into two banks of shift registers. 
The parallel data in the shift registers is then clocked into 
the output flip-flops at the frame rate. An extra bank of 
latches is added for data byte 0. which enables the ping- 
pong mode of operation. 

Since there are two possible ways in which the deciphering 
can occur, that is, bit one could be in either bank one or 
bank two. proper decoding is needed lo reassemble the final 
byte pattern. This is accomplished by the selector block 
preceding (he output flip-flops, and is controlled by how the 
comma character is detected within the two banks. When 
either case is detected, the reset indicator to the clock gen- 
erator is set high. Since this signal is a critical path in the 
overall operation of the chip, it is retimed to give the clock 
generator more margin to reset. As a result, the data is 
delayed by an additional cycle before being loaded out. This 
delay is compensated by extending the shift register count 
by one. 

The data is further delayed before unloading by the anti- 
sliver feature (discussed next) in the clock generator where 
the clocks are extended. The number of registers is in- 
creased in the same manner to compensate. 

Receiver Clock Generator and Aiitisliver Circ uit 

The logic diagram of the (.-lock generator is shown in Fig. 9. 
In a manner similar to its transmitter counterpart, it takes 
the VCO otitpul and generates all of the necessary clocks 
required by the chip. Il includes an antisliver circuit, which 
ensures thai the frame clock presented lo ihe user has no 
"slivers," as explained below. 

The core of the generator is a divide-by-5-or-10 counter. To 
minimize power, the frame clock goes through a 2,10 scaling 
for the 1062.5-Mbit/s mode, and a 2,2.5 scaling for the 
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531.25-Mbit/s mode. However. RBCLK requires a clock at the 
frame rate that must have a 50% duty cycle. Since this is not 
possible with the natural states within the divide-by-five 
counter (2/5 or 3/5). the pulse of the 2/5 count is extended 
by one cycle of the high-speed clock, yielding the required 
50% duty cycle. 

When the reset signal is applied, the counter is forced to a 
predetermined state. Because the previous state of the 
counter is random, the frame clock may contain short pulses 
or slivers, which could cause problems for the user. The 
antisliver circuit continuously monitors the counter for this 
condition and masks these bursts as they occur. The logic 
accommodates both the 531.25-Mbil/s mode and the 
1062.5-Mbit/s mode. 

Receiver Phase-Locked Loop 

The phase-locked loop of the receiver uses the same basic 
configuration as the transmitter phase-locked loop, with the 
addition of a phase detector for NRZ data The design of this 
detector is identical to the detector used in another propri- 
etary HP IC. 4 with the exception thai the falling edges are 
ignored. This is to eliminate any effect of the excess jitter of 
ihe falling edge, which arises from the self-pulsing CD 
lasers. The lock-to-reference (-LCKREF) input enables the 
user to activate the frequency detector for initial frequency 
acquisition. 

Receiver Loss-of-Signal (LOS) Detector 
Willi major concents for eye safety, the Fibre Channel stan- 
dard calls for redundant LOS detectors within the optical 
module to ensure a robust system. Two LOS detectors are 
incorporated in the receiver IC. and are provided as outputs 
to the OFC chip. Since the alarm outputs are heavily filtered 
within the OFC chip, hysteresis is not necessary in the LOS 
design. 

The LOS detectors are based on a concept of envelope 
detection without the use of a capacitor. One detector is 
configured to ileleci t In- loss of amplitude resulting from a 
lack of received signal. The second detects the condition in 
which the differential Inputs are sialic, indicating a fault in 
the optical receiver. Both LOS detectors are further digitally 
filtered, witli one driven by the reference clock and the 
other by an internal clock. This further ensures the reliabil- 
ity of the fault detection system for maximum safely. The 
triggered threshold is preset to 25 niV and can be adjusted 
with an external resistor, as shown in Fig. 10. 

The receiver front-end sensitivity is well below Ihe nominal 
LOS threshold of 25 mV. Fig. 1 1 shows ihe bil error rate 
(BER) as a function of Ihe input differential signal. The BBS 
is basically zero for signals 6 mV and above. Because it is 
impractical to perform actual tests for BER as low as 10"-° 
(—3000 years), one can use the plot to extrapolate the HER 
for Ihe incoming signal amplitude. Tests have been run for 
weeks will tout a single error. 

Transceiver Chip 

One major target application for Fibre Channel is disk ar- 
rays. This application demands a much lower-power and 
lower-cost solution than Ihe chipset offers. The new IIP 
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HDMP-1526 transceiver is a 1 0-bit , 1002.5-Mbaud transceiver 
designed for this Fibre Channel arbitrated loop (FC-ALJ 
market. It is a descendant of the HDMP-1512/HDMP-1514 
20-bit, 1 062.5/53 1.25-Mbaud transmilter/receiver chipset. 
The Fibre Channel ellipse! has many functions not needed in 
the FC'-AL transceiver chip, such as optical interface blocks. 
Fig. 12 shows the block diagram of the transceiver. The re- 
duction in functions and the change to a 10-bit bus allowed 
the integration of both transmitter and receiver functions 
onto a single die, using a proprietary HP device array. The 
transceiver uses a 10-bit parallel interface running at 
100.25 Mbaud instead of a 20-bit parallel interface running 
a i 53.125 Mbaud. 

Deletions on the transmitter side include the ac laser driver 
and its supporting dc bias circuitry and the comma genera- 
tor function. Deletions on the receiver side include the pow- 
er-on/reset circuitry, the loss-of-signal circuitry, and the 
cable equalizer function. Deleted functions common to both 
the Transmitter and receiver include the speed selector and 
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Fig. 11. BER as a function of data inpul amplitude. 

ttie ping-pong selector. The loopback ports were also de- 
leted because an internal connection is possible with a 
single-chip design. This leaves just one high-speed output 
port and one-high speed input port. Other timing changes 
were implemented to fit specific customer needs. 

In the single-chip design, select inputs and clocks are shared 
between the transmitter and receiver. This lowers the power 
requirements, reduces the number of pins, and makes pos- 
sible a smaller chip size if a custom layout is done at a later 
date. The transceiver, with its 1.8-watt total power dissipa- 
tion (compared to 3 watts for the chipset), is packaged in a 
single (34-pin 14-by-ll-min quad flat pack. 

Isolating the I wo independent phase-locked loops within a 
3.54-mm-by-3.54-mm area presented the biggest challenge 
during the layout of the chip. The transmitter and receiver 
phase-locked loops are placed at opposite corners to mini- 
mize cross talk. Various port ions of the chip are isolated by 
using separate power supplies and bandgap references. 
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Fig. 13. Transceiver die (proprietary HP array). 

Although much of the c hip and block-level layout was re- 
done starting from the 20-bit chipset, the proprietary HP 
array enabled us to complete the transceiver design very 
quickly. Fig. 13 shows a photograph of the FC-AL 10-bit 
transceiver die implemented on the array. 



Summary 

A two-chip gigabit chipset conforming to the FC-0 specifica- 
tion has been fabricated. The speed of the chipset is user- 
selectable at either 1062.5 Mbaud or 531.25 Mbaud. The 
transmitter integrates a high-speed laser driver eapable of 
driving either 780-nm CD lasers or 1300-nm lasers. The re- 
ceiver has redundant loss-of-signal detectors for eye safety. 
The chipset runs on a single -frSVdc supply and has TI L data 
and control interfaces. Implementation using a proprietary 
HP device array allowed a quick design cycle to produce a 
10-bit single-chip transceiver for the FC-AL market 
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Applying the Code Inspection Process 
to Hardware Descriptions 



The code inspection process from the software world has been applied to 
Verilog HDL (hardware description language) code. This paper explains the 
code inspection process and the roles and responsibilities of the 
participants. It explores the special challenges of inspecting HDL, the 
types of findings made, and the lessons learned from using the process 
for a year. 



by Joseph J, Gilray 



The primary goal of the code inspection process is to maxi- 
mize the quality of the rode produced by an organization. A 
secondary benefit of the process is that il allows members 
of I he development teams to share best practices. The code 
inspection process revolves around a formal inspection meet- 
in}". The process calls for the development of operational 
definitions, planning, a technical overview, preparation for 
the meeting, rework after the meeting, and follow-up. Fig. 1 
illustrates the relationships between the steps. The steps 
themselves are described in the sections thai follow. As 
shown in Fig. 1, the operational definitions affect all stages 
in the inspection process. Between some of I he stages in 
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the inspection process decisions whether or not to continue 
need to lie made. These decisions are indicated on the figure 
by "Proceed?". 

The code inspection process as implemented ai the HP Inte- 
grated Circuits Business Division in Corvallis. Oregon (ICBD 
Corvallis) contains several roles: process manager, modera- 
tor, author, paraphrase!' (reader), scribe, and inspector. 
There is only one permanent role, that of inspection process 
manager. The remaining roles are filled for each inspection. 
The subsections below call out the general responsibilities 
and duties of each role. Specific tasks are called out in the 
description of each process stage later in the paper. I IP's 
software quality engineering department has published 
checklists for each role that can be very useful when gelling 
started with the process. 

Process Manager. Ensures thai best practices are spread 
among the designers of the organization or project Tasks 
include developing and publishing operational definitions 
(described in the next section), disseminating best practices 
and common defects, and acting as an advocate for the HDL 
code inspection process. This last item cannot be over- 
emphasized. The process manager must ensure that priority 
is given to inspections even in the face of mounting sched- 
ule pressure on the design team. Il is also important that the 
process manager make clear thai the specific results (de- 
fects found ) of the inspect ions will not be made available to 
management or any other party. The inspection process can 
only succeed in an environment where the members of the 
design team feel secure in opening their code to review. So 
I hat management sees the value of the process, the process 
manager should keep general results of the overall inspec- 
tion process such as the number and type of defects found, 
the time spent, lines of code inspected, and most important, 
best practices shared. These process statistics are very use- 
ful, but the grass roots support (hat develops for the inspec- 
tion process will be the real indicator of its value. 

Moderator. Manages each slep of the process for a given in- 
spection. Ensures that participants are prepared and that 
requirements are met. 



Fig. 1. Tin null- inspection process. 
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Author. Prepares the HDL for inspection Creates supplemen- 
tary documentation (such as block diagrams ) as necessary 
to explain the purpose of the code. Open to suggestions and 
defects. Reworks the HDL as necessary. 

Paraphraser. Familiar with guidelines and best practices. 
AWe to explain the HDL code during the inspection meeting. 

Scribe. Logs defects and enhancements found during the 
meeting. 

Inspector. Reads and understands the HDL Notes any de- 
fects, comments, or enhancements before the meeting. 
Every person involved in the meeting participates as an 
inspector. 

Development of Operational Definitions 

An operational definition is simply a standard. Before an 
inspection takes place a core set of operational definitions 
should be in place and recognized by the design team. They 
are developed from conventions, guidelines, industry stan- 
dards, and recognized best practices. For HDL code inspec- 
tions at ICBD Corvallis, we adopted the simplest set of oper- 
ational definitions that we felt were adequate to guide the 
process: 

• Coding style standards. Although no explicit HDL coding 
standard was selected, we developed a standard IIDL mod- 
ule header (Fig. 2). 

• Definition of a defect. We defined a defect as any deviation 
from the module specification as presented in the technical 
overview meeting (see below) and the HDL module header. 

• Defect severity codes. We applied a Simple defect severity 
scale based on HP's internal Defect Tracking System (DTS), 
as shown in Table I. 

Table I 
Defect Severities 

Name ID Description 

( 'ritieal 9 Defect will lead to unworkable or grossly 
inefficient design. 

Serious 7 Defect will lead to a large deviation from 
the specification or to a design that is un- 
reliable or veiy inefficient. 

Major 5 Defect will lead to a deviation from 

the specification or to a design that is 
inefficient. 

Minor 3 Defect will lead to a minor deviation from 
the specification or to a design that is 
slightly inefficient. Also used when code 
is it) serious need of comments to be 
maintainable. 

Wibni 1 "Wouldn't it be nice if...?" This ID is used 
for enhancement requests, which are typi- 
cally changes in coding style or requests 
for clarifying comments in the code. 

• Defect logging standards. We stalled out using inspection 
data .summary sheets provided by HP's software quality 
engineering department, but altera few inspections we 
found that an open-format inspection process ami defect 
logs worked belter. 



// filename 
II Module namelsl 
II Author namels) 
I Revision log 

II File description why are these modules grouped togetheri 

// ... 

Module name (tor each module | 

Module description 
II Signal descriptions Ithese include all HOL signals including wiresi 
// For each signal specify: 
// - type 
// - purpose 
// - values/states description 

// - invariants (such as Instate nodes lhat are always drivenl 
// - special loading conditions 
// - value at reset 

// - overflow/wraparound condilions (e g lor countersl 



Fig. 2. Standard Verilog HDL module header adopted for code 
inspections. 

> Target-based best practices. ICBD Corvallis dev eloped a set 
of Verilog HDL coding guidelines to ensiire reliable, high- 
quality synthesis results. These guidelines include sections 
on clocking strategies, block structure, latches and registers, 
state machines, design for test, ensuring consistent behav- 
ioral and structural simulation results, and issues specific to 
Synopsys synthesis tools, which arc used extensively by HP. 
This document provided valuable input to the HDL code 
inspection process and itself benefitted from Ihe practices 
shared during the inspections. 

• Inspection entry criteria. The inspection entry criteria were 
that the HDL had to be functionally correct in behavioral 
simulation and had to be of small-to-moderate size (100 to 
700 noncomment Verilog HDL source statements). 

i Inspection exit criterion. We did not develop a formal in- 
spection exit criterion. Instead, the moderator was given the 
responsibility of ensuring that rework was satisfactorily 
completed for each piece of HDL inspected. 

Planning 

When a designer feels that a piece of HDL code is a good 
candidate for inspection, the designer asks another designer 
to act as moderator. Together they review the HDL to be 
inspected to ensure that it meets the entry criteria, espe- 
cially lhat the amount of IIDL to be inspected is appropriate. 
In addition, they review any supplementary documentation 
such as module specifications or block diagrams and dis- 
cuss what will need to be presented at the technical over- 
view meeting. The moderator, with help from tin- author, 
assembles the rest of the inspection team: a paraphrase! 
(reader), a scribe, and up to three additional inspectors. It is 
the moderators responsibility to schedule the technical 
overview meeting and to ensure that the inspec tion team 
members are prepared to meet their responsibilities. The 
moderator should treat the meetings and preparation as a 
very important requirement for each participant. Every per- 
son involved must be prepared — at a code inspection, no 
one is just an observer. 

Technical Overview 

The technical overview meeting should last no more than 
fO minutes. Its primary purpose is to allow the author to 
outline the module(s) to be inspected and to answer ques- 
tions. The roll's are formally assigned during this meeting 
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and 1 he moderator should ensure llial all participants under- 
stand I he roles assigned to them. If there are inexperienced 
inspection team members, the moderator should lake time 
lo explain the operational definitions and to pass out re- 
sponsibility checklists for each role. Finally, any supplemen- 
tary documental ion and the HDL code itself are distributed 
to the team. The code should be printed with line numbers 
so that during the inspection meeting all leant members can 
more easily follow the discussion. 

Preparation 

Each member of the inspection team should spend from two 
to four hours reading over I he HDL. Team members should 
mark possible defects on their copies of the code. Team 
members should freely discuss the code among themselves 
but not in a wider context, to protect the privacy of the 
author. The team should be given ai least a week to look 
over the HDL. During this lime the moderator should sched- 
ule the inspection meeting. Before the meeting the modera- 
tor should ensure that all team members are prepared and 
can participate in the meeting before allowing the inspec- 
tion to proceed. 

Inspection Meeting 

The inspection meeting is the heart of the process. The mod- 
erator must reserve a quiet room for a sufficient amount of 
time. Typically inspection meetings lake from two lo Ihree 
hours. The moderator is also responsible for keeping the 
meeting on I rack so the code can be completely inspected in 
the time allowed. To start the meeting the scribe should re- 
cord the amount of preparation time required of each partic- 
ipant. The paraphraser should announce the order in which 
the code will be inspected. Typically this is lop-down or 
bottom-up. The paraphraser explains each block of code and 
allows time for each inspector to discuss possible defects or 
enhancements lo that code. 

The goals of the meeting may vary somewhat from organiza- 
tion to organization, but typically the major goals are to find 
defects in the code under inspection and to share best prac- 
tices among the members of the design or coding learn. In 
our process, we encouraged discussion of any defect or 
enhancement. Although this does not strictly adhere lo the 
traditional software inspection process, we felt the benefits 
I improved coding, simulation, and synthesis practices) justi- 
fied the time spent. 1 

The moderator must ensure thai any defect or enhancement 
is recorded by the scribe and dial the inspection team agrees 
to the severity assigned to each item. To keep the group on 
track, the moderator should not allow long discussions of 
the severity of any defect. Where no agreement can be 
reached, the moderator should assign a severity. If the as- 
signment of a severity code becomes a stumbling block to 
progress in several meetings, a simpler major/minor severity 
classification can be adopted as an operational definition.* 
Tin- moderator should keep track of any best practices that 
come up during the meeting that arc not already part of die 
operational definitions and note any questions raised about 
related design processes and tools. 

Rework 

After the meeting the scribe gives die defect log to Ihe author 
(and only to the author). It is the audior's responsibility to 



modify i he HDL code as appropriate. If the author or Ihe 
moderator feels thai the DDI. should be reinspected. an- 
other meeting can be scheduled (this should be very rare, 
and should proceed with a different set of participants in all 
roles other than the author). 

Follow-up 

After each inspection the moderator should investigate any 
questions that were brought up about design processes and 
tools, such as simulation and synthesis. The results of the 
investigation along with any new best practices should be 
published for the design learns. The moderator and process 
manager should also review the operational definitions and 
update them. Finally, the process manager should update 
die overall inspection process statistics. 

HDL Issues 

When inspecting < ode written in .1 high-level software lan- 
guage. normally there is a single target compiler and plat- 
form. We found that a major difficulty with inspecting code 
written in a hardware description language was deciding on 
a target on which to focus. IIDL is traditionally targeted to 
both simulation and synthesis (among a growing list of HDL 
source-level tools). We started by trying to inspect HDL 
without thinking in terms of a specific target. In theory, it 
might be possible to inspect IIDL code as an abstract de- 
scription. In prac tice, it was nearly impossible. Both Ihe 
expected simulation results and the actual Implementation 
created by the synthesis process were always on the minds 
of the inspectors (see Fig. 3). Furthermore, operational defi- 
nitions such as defect severity are invariably developed and 
interpreted with reference to a target. 

By the time we had done several Inspections it was evident 
that the most common practices being shared in the meet- 
ings were related to register transfer language (RTL) coding 
for syndiesis. Since the synthesis tools are not as mature as 
either compilers for high-level languages in the software 
domain Or Simulators in the hardware domain, we spent a 
good deal of lime discussing what structural elements the 
synthesis tools would create from the RTL-level HDL code 
given a set of constraints and synthesis opiions. This seemed 
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Fig. 3. HDL is targeted to both simulation and synthesis. 
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iiaiural given the complexity of the synthesis tools. At times 
the inspection meetings focused more on the synthesis tools 
than on the HDL When writing HDL for synthesis the types 
of complexities involved are more akin to porting a complex 
piece of software between frameworks than between com- 
pilers, Therefore, it is inevitable that the inspection meet- 
ings devote a good deal of rime to synthesis. As use of very 
complex source-level tools such as behavioral synthesis 
tools becomes widespread this effect will become more pro- 
nounced. In fact, one benefit of the HDL inspection process 
is to share information about the tools used during design 
creation. This happens less frequently in a traditional soft- 
ware inspection where the compiler is less configurable and 
better understood. But where the software is targeted at 
many platforms, this type of discussion occurs in the soft- 
ware domain as well. 

Another difference we found between HDL code inspections 
and software inspections was that often there were questions 
that couldn't be satisfactorily answered during the inspection, 
such as "What will the synthesizer produce from the follow- 
ing code (e.g., mixed addition and subtraction of registers 
with differing widths)?" It was up to the moderator to follow 
up with the author (or another inspector) on questions that 
couldn't be answered in the meeting and to write up a re- 
sponse for the design team and for possible inclusion in the 
best practices guidelines. 

Table II indicates the kinds of topics that came up during the 
inspections and their approximate frequency. 



Table II 
HDL Inspection Topics 

Frequency Topic 

35% HDL coding style, standards, and guidelines 
(e.g., when to use blocking and nonblocking 
assignments, etc. ) 

30% Structures produced by synthesis tools (HDL 
compiler, design compiler, finite state ma- 
chine compiler) 

10*6 Differences between simulation results and 
synthesis results 

10% HDL efficiency considerations (e.g.. inference 
of unnecessary latches, use of extra clock 
cycles) 

10% HDL documentation 
5% HDL block structure 

As more HDL inspections were performed, the number of 
experienced inspectors grew and the guidelines for creation 
of HDL for synthesis, which had been created by synthesis 
users in the lab, became widely disseminated and discussed. 
Again, one of Die primary benefits of the HDL code inspec- 
tion process is the spread of best practices among the larger 
group of designers. 

Lessons 

As the use of HDL increased in our lab, we noted a need for 
tools to improve the quality of the HDL produced by the 
design teams. The lack of HDL source-level tools such as 
code complexity analyzers, lint (a syntax checker), anil 



others led us to choose a less automated approach. Our first 
effort at improving the quality of HDL was to develop an 
HDL code inspection process based on the insj>ecrions done 
for software written in high-level languages. 

The process that evolved for inspecting HDL in our lab in- 
corporates elements of both a formal code inspection pro- 
cess and a structured walkthrough process. Although we 
gave importance to the technical overview meeting, it wasn't 
always held, especially if inspection team members were 
offsite. Furthermore, both the rework and the follow-up 
steps were left to the moderator and author and checked 
only informally by the process manager- 
Early in the adoption of the process we used a set of respon- 
sibility checklists for each role. As time went on we found 
that these were not strictly necessary but did engender a 
feeling of formality. It is important that the participants take 
the process seriously to ensure that the time spent on it is 
not wasted. 

Over time we came to realize the importance of the technical 
overview meeting. If it is impossible for the author to attend 
the meeting (we ran into several cases where the author was 
from another site and unable to attend a technical overview) 
then someone else on the inspection team should take the 
authors place for the meeting. In cases where we skipped 
the meeting, the preparation time for each participant in- 
creased dramatically. In one case the inspection required 6 
to 10 hours of preparation time. Though the code was fairly 
long at 900 lines of HDL. this was an unreasonable amount 
of time to expect from each reviewer and could have been 
reduced by half had there been a one-hour technical over- 
view held. 

In our experience, the most significant benefit of the HDL 
inspection process was to spread HDL, simulation, and syn- 
thesis best practices among the design teams. Not only did 
the process encourage interaction between various teams 
within K'BD, but several design teams in HP entities outside 
of ICBD brought code to us for inspection. To ensure that 
this benefit is realized it is very important that the process 
manager and the moderators lake the time to publish the 
guidelines that arc developed during each inspection. As 
designers become proficient at creating HDL and knowl- 
edgeable of synthesis and simulation best practices, and as 
HDL coding guidelines become well-established in an orga- 
nization, the need to do inspections to spread best practices 
decreases. 

We found relatively few major defects in the HDL code that 
was inspected, probably because the code was all at the RTL 
level and simulated and synthesized before inspection. Stud- 
ies have indicated that the inspection process gives the best 
results when applied at a high level of abstraction. I contend 
that we will find more defects if we apply the process to 
module specifications or to behavioral HDL. If the target 
chosen is complex (as behavioral synthesis tools currently 
are) the tendencj for the process to Incus on the tool 
instead of the code will also be more pronounced. Even so, 
applying the inspection process to higher-level abstractions 
may be a logical next step. Doolan wrote, "As people be- 
come aware of the tremendous benefits of the inspection 
process, there is an increasing desire to apply it to other 
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software items, such as user documentation ... inspection 
breeds inspection." 2 

Summary 

Reference 3 describes one [CBD C'orvallis project that used 
the HDL Inspection process (however, inspections are not 
discussed in reference 3). 

The code inspection process can be applied successfully to 
hardware descriptions if the following conditions are met: 
A simple set of operational definitions is developed for the 
process. 

Engineers are willing to open their code to inspection and 
(he process is viewed by the design community as beneficial 
and important 

Management gives project teams adei|iiale lime to perforin 
inspections. 



• Best praciices and guidelines are recorded and updated. 

For project teams just starling to use hardware description 
languages in the design process, code inspections can play a 
vital role in ensuring high-quality HDL. Al IC'BD CorvaJlis, 
we found that the inspection process works extremely effi- 
ciently in spreading best practices for HDL coding, simula- 
tion, and synthesis, 
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Overview of Code-Domain Power, 
Timing, and Phase Measurements 

Telecommunications Industry Association standards specify various 
measurements designed to ensure the compatibility of North American 
CDMA (code division multiple access) cellular transmitters and receivers. 
This paper is a tutorial overview of the operation of the measurement 
algorithms in the HP 83203B CDMA cellular adapter, which is designed to 
make the base station transmitter measurements specified in the 
standards. 

by Raymond A. Birgenheier 



In 199-1. the Telecommunications Industry Association (TIA) 
released the IS-95 and IS-97 standards developed by the TIA 
TR-45.5 subcommittee. These standards ensure the mobile- 
station/base-station compatibility of a dual-mode wideband 
spread spectrum system — the North American CDMA (code 
division multiple access) cellular telephone system. 1 CDMA 
is a class of modulation that uses specialized codes to pro- 
vide multiple coinmiuiication channels in a designated seg- 
ment of I he eleei r< .miagnetic spectrum. The TIA IS-95/97 
standards specify various measurements that must be made 
on CDMA base station and mobile station transmitters and 
receivers to ensure their compatibility. The IIP tf:.»203B 
CDMA cellular adapter (or the HP 8921 A Option 600 cell site 
test system is designed to make the base station transmitter 
measurements specified in the standards. The IIP 83203B 
algorithms provide accurate measurements of code-domain 
power, time, frequency, and phase. This paper is a tutorial 
overview of the operat ion of the measurement algorithms in 
theHP832(MB. 

The HP 83203B measurement algorithms provide a charac- 
terization of the code-domain Channels of a CDMA base 
station transmitter. ( Hie of the measurements, called code- 
domain power, provides the distribution of power in the 
code channels. This measurement can be used to verify that 
the various channels are at expected power levels and to 
determine when one code channel is leaking energy into the 
other code channels. The crosscoupling of code channels 
can occur for many reasons. One reason is a time misalign- 
menl of the channels, which would negate the orthogonal 
relationship among code channels. Another reason ma\ be 
the impairment of the signals caused by nonideal or mal- 
functioning components in the transmitter. To determine the 
quality of the transmitter signal, a waveform quality factor, 
p, is measured. It is the amount of transmitter signal energy 
that correlates with an ideal reference signal when only the 
I > 1 1 < • I channel is transmitted 



Another set of measurements, called code-domain timing 
and code-domain phase, determine how well-aligned the 
code channels are in time and in phase. The parameters 
measured are time offsets and phase offsets of active code 
channels relative to the pilot chaiuiel (code channel 0). 

To make these measurements to the precision specified in 
the IS-97 standard, il is necessary to establish the time origin 
and the carrier frequency of the signal to be measured. The 
HP 83203B provides these measurements. Another mea- 
surement that may be useful when diagnosing the causes of 
poor transmitter signal quality is the carrier feedthrough in 
the transmitter signal. The effect of carrier feedthrough will 
also be seen when measuring code-domain power. 

This paper presents ( 1 ) the general concepts of CDMA sig- 
nals and measurements, ( 2) the signal flow of the measure- 
ment algorithms, (:i) the Specifications from the IS-97 stan- 
dard and performance predictions for the measurement 
algorithms based on mathematical modeling and simula- 
tions, and ( I) some typical results of measurements made 
withtheHPS3203B. 

CDMA Operation 

The channel structure for a ("DMA base station transmitter 
is shown in Fig. 1. There are 01 code channels, correspond- 
ing to (>A Walsh functions, each 64 chips long.* To see how 
the Walsh functions provide the channelization, we will con- 
sider a hypothetical example of four code channels pro- 
duced by the four orthogonal Walsh functions shown in 
Fig. 2. The sums shown in Fig. 1 are modulo-2, as defined in 
Table I. They are appropriate when a 0,1 representation is 
used for binary numbers and are equivalent to ordinary mul- 
tiplication when a 1,-1 representation is used. The Walsh 
functions use nouret urn-to-zero (NRZ) values of I and -1 to 
represent binary* numbers. 

The chip interval is the clock period ol the spreading code used in a spread- spectrum system 
In this paper, a chip corresponds to one binary digit ot tho pilot pseudonoiae sequences shown 
in Fig I 
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iviiruan, 1906 fltmfeK-Padord Journal 7: * 



I-Channel Pilot Pseudonoise Sequence 
1.2288 Mbits/s 



Pilot Channel (All Zeros) 



-K*>- 

Walsh Function wg 



Transmit 
Filter 



Transmit 
Filter 



Inputs: 
13.2 hops 



Code Channel 1 



Q-Channel Pilot Pseudonoise Sequence 
1 2288 Mhits/s 



— ►© ► 

Walsh Function W| 



Code Channel 63 
— +<•)- ► 

Walsh Function w 6i 



Same As Above 



Same As Above 




Fig. i. Forward ''DMA (base station 
transn litter] channel slrnclure. 



Table I 
Modulo-2 Sum (XOR) 



e 


0 1 


0 

1 


0 

1 


1 

0 



The Walsli functions are said to be orthogonal because the 
inner product of Wj('t') and wt(t) is: 




i = J 



i * j 



(1) 



thai is, the inner product of two distinct Walsh functions is 
zero. 

The orthogonality property produces the channelization, as 
we can see by considering the transmission of a binary digit 
(bit) that is four chip intervals long on channel L If the bit 
is represented by +1, then at the transmitter and, ideally, at 
the receiver the bit is represented by ±W|(t). At the re- 
ceiver, an operation equivalent to equation 1 is performed 
on ±wi(t)wj(t) for each channel for 1 = 0, 1.2. 3. This opera- 
lion produces the result: 



i 

\ 



w,(t)Wi(t) = ±4. i = I 



= 0, L=l 



(2) 



Therefore, we see that the bit can be detected on channel 1, 
but it does not appear on channels 0. 2. or 3. 



The fi4 Walsh functions used for the channelization shown in 
Fig. 1 are represented by 6-1-bit words that are rows (or col- 
umns) of a 04 >: (14 Hadaniard matrix. The Hadaniard matrix 
is orthogonal (i.e., rows or columns are orthogonal) and can 
be generated by the following simple algorithm: 



A 2 x 2 Hadaniard matrix is defined as: 
II 



ii ii 

0 1 



i 

I o 



-1 ■■ 
i 



I 0 



i 

0 



-1 ■ 



1 

I 0 

-i 4 



(3) 



1 2 3 

Time t (chip intervals) 



Fig. 2. Four orthogonal Walsh (unctions. 
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i A 4 x 4 Hadamard matrix is generated as: 
[0 0 0 01 

6 i o i 

0 0 11 
0 1 1 OJ 

III general, a Hadamard matrix Ht; n is generated from a 
ilariamard matrix H„ by: 



Til Bsl 



(4) 



Hi, H„ 

H n H n J 



(5) 



Tlie inner product of Two rows of H„ is obtained by the niod- 
ulo-2 summing of the two rows, element by element, and 
counting the difference between the number of 0s and Is. 
where the modulo-2 sum is the XOR operation defined in 
Table L For example, to obtain the inner product of rows 1 
and 2 of ILj. WB perform the following operation: 



0 0 



1 (i 1 



(6a) 

Inner product = number of 0s 
minus number of Is = 0 



If a l.-l representation is used for the binary numbers, then 
the inner product given by equation 6a is simply. 



1 1 
1 -1 



1 1 
1 -1 



(6b) 



1 -1 



1 -1 



Inner product = sum = 0. 



Fig. 3 shows an example of the pseudonoise encoding 
shown in Fig. 1 for code channel 1. The input bits, denoted 
by d,. are added (modulo-2) to the Walsh function Wj and 
Ihen to the 1-channel and Q-channel pseudonoise sequences 
ip„ and (W The resulting modulo-2 sums are converted to 
±1 for Ik and Q|<, where + 1 represents binary (I and -1 repre- 
sents binary 1. The discrete time signals I k and Q|< provide 
the inputs (0 the iransinit filters. The outputs of these fillers 
are the superposition of pulses centered ai discrete times l^, 
k = .... 0, 1, 2 as illustrated in Fig. 4. 



Input Bits d,: 164 Chips Long) 
d, d 2 



Walsh Function (W| shownl 



0 10 10 10 1 0 10 1 



"3 



I- Channel Pseudonoise Sequence (i pn ): 
Q-Channel Pseudonoise Sequence (q pi( |: 



I 



0011110... 1001 



1-1-1-1-1-1 1 1 ... I 1 1 1 1 



-1-1 1 1-1 1-1-1... I 1 1 -1-1. 



FiR. 3. I'scuiloimisi- enccMliiiR 




'k-l 



Forl k =1 



/ .A 



* Time I 



H* 

*< Fbrln =-1 

» » 

* I ' 

FiR. 4. 'IViuisirut filter output 

If the pulse for 1^ or Qk equals zero when t = t.j. i * k. then 
the pulses at the outputs of tlie transmit filters do not inter- 
fere with each other at discrete times t^. k = .... 0. L, 2, .,. and 
we say the transmit filters introduce zero intersymbol inter- 
ference. 

Tlie transmit filters illustrated in Fig. 4 introduce zero inter- 
symbol interference. However, die transmit filter specified 
in the IS-95 standard does introduce intersymbol interfer- 
ence. Moreover, die base station transmitter specified in the 
standard must incorporate an all-pass phase preequalizer. 
which produces an asymmetric transmitter pulse response. 

The reason for the I-Q structure shown in Fig. 1 will become 
clearer after we consider code-domain signals. 

Code-Domain Signals (Forward Link) 

Any sinusoidal carrier with amplitude and phase modulation 
can be written mathematically as: 



X(t) = A(t)cos|(..,.l + <!>(!)] 



(7) 



where o>,. = 2Jtf e ( I", is the carrier frequency in Hz). A(t) is 
fit*- instantaneous amplitude, and <I>(l ) is the instantaneous 
phase. 

Using the trigonometric identity cos(8+tp) = cosOcostp- 
sinttsiiKp, equation 7 can be rewritten as: 



X(l) = A(t)cos<I>(t)cos(i.,.t - A(t)sin'I>(i)sin<i),t 
= I(t)cos(i),.l - Q(l)siii(o,.t, 



(8) 



where the in-phase component of the signal (the component 
multiplying the carrier rosea,.!) is: 



1(1) = A(t)cos<I>(l). 



and the quadrature component (the component multiplying 
the quadrature carrier -sinco. t) is: 



Qll) ■ Adisin'I'd). 



(10) 



Using Sillers identity, e-i" = exp(.j9) = COSfl * jsinB, we can 
write: 



lit I + jQ(l) = A(t)e'*"". 



ill. 
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Hi )+.iQ( I ) is called the complex envelope or the modulated 
eairier anil is represented as a rotating phasor as shown in 
Fig. 5. Hie tip of the rotating phasor moves as a function of 
time forming the locus referred to as the signal trajectory. 

The forward link of the CDMA system uses quadrature 
phase-shift keying (QPSK) modulation. First, we will con- 
sider the case in which only the pilot signal is present. In 
this case, if no intersymbol interference is introduced by the 
transmit Biter, the signal trajectory passes through four dis- 
crete points separated by multiples of 90 degrees in the I-Q 
plane as shown in Fig. 0, These four points on the I-Q dia- 
gram are referred to as I he signal constellation for the QPSK 
modulation. 

The coordinates of these points represent the four possible 
values of a pair of bits. As the signal moves along its trajec- 
tory, the coordinates at discrete lime tk represent the pair of 
bits transmitted at this time. The example signal trajectory 
presented in Fig. (i is for the first eight pairs of bits of the 
pilot sequences with corresponding times t|<, as given in 
Table IL 



k 

ipn 



Table II 

First 8 Pairs of Bits of Pilot Sequences 

1 2 3 4 5 



-1 
-1 



-1 
1 



1 

-1 



-1 
-1 



6 
I 

-1 



1 
-1 



S 
-1 
1 



Now we will consider a case in which the pilot (code chan- 
nel 0) and code channel 1 are transmitted simultaneously. 
In this case, the transmitter signal can be represented as: 



X(t) = A 0 (t)cos[<..,.t + <I>oU)] 
+ Ai(t)cos[i.) c t + <J>i(t)] , 



(12) 



where Ao(t ) and <P ( i( t ) represent the amplitude and phase 
modulation introduced by the pilot and Ai(t) and <t>\(t) rep- 
resent the amplitude and phase modulation introduced by 
code channel 1. Using the trigonometric identity cos(8+ip) = 
cosOcosip - sinGsinip, we can write equation 12 as: 

X(t) = [A u (t)cos4>„(t) + Ai(t)cos*|(t)]cos(w,.t) 

- [A<)(t)sfn$o(tD + A](Osin<l>i(t)]sin(oj ( .t) (13) 
= I(t)cos((i) r t) - Q(t)sin(iu,.l), 




Fig. 5. The complex envelope of the modulated carrier is repre- 
sented as a rotating phasor. The locus of the tip of the phasor is 
called the signal trajectory. 



-1.11 • 



'3-'f 



>5v 



-1,-11 9h- 



ii,D 

'2 



I 



w 

x 



Fig. 6. Example of a signal constellation (points) and a signal 
trajectory. 

where 

1(1) = A,)(t)cos<P n (t) + Ai(t)cos*i(9 (1-4) 

mid 

Q(l) = A„(t)sm<l>„(t) + A|(t)sin<l>i(t.). (15) 

From equations 1-1 and 15. il is clear that since 

I(t) = I„(l) + Mt) and Q(l) = Q () (() + Q,(t). (l(i) 

I(t ) and Q(t) are simply the superposition of the correspond- 
ing components produced by the pilot and code channel 1. 
Therefore, we can superimpose I-Q diagrams. 

To simplify the description at this point, we will consider I he 
code channels produced by four orthogonal Walsh words 
each four chips long, as shown in Table III. 





Table III 
Orthogonal Walsh Words 




wo: 


1 


1 1 


1 


w,: 


1 


-1 1 


-1 


wo: 


1 


1 -1 


-1 


w 3 : 


1 


-1 -1 


1 



For illustrative purposes, we w ill assume that the peak mag- 
nitude v 2a ( i = jAn(tk)| pcak of the pilot (code channel 0) is 

0.8 , 2 and the magnitude , 2 ai = |Ai|ti<)! p(iak of the signal 

for code channel 1 is 0.6 > 2. so that the root-sum-square of 
the pilot and code channel 1 signals is: 



» 0,8 2 + 0.6 2 = 1.0. 



(17) 



In this case, the pilot signal has the trajectory shown in 
Fig. 6. except that the signal coordinates are (±0.8. ±0.8) 
instead of (±1, ±1). 

To determine the trajectory produced by code channel 1, we 
must consider multiplying Walsh word Ht\ by data bits. For 
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our example, we will assume data bits for two Walsh func- 
tion intervals: d = 1. -1. We obtain values for I] and Qi as 
presented in Table IV. 

Table IV 
Calculation of I, and Q- 



Time 


'I 


t2 


t3 


t-t 


ts 




t- 


t6 




-1 


1 


-1 


1* 


-1 


rijn 


i 


-1 




-1 


1 


1 


-1 


-1 


-i 


-i 


1 


w. 


1 


-1 


1 


-1 


1 


-i 


i 


-1 




-1 


-1 


-1 


-1 


-1 


-i 


i 


1 


w lQpn 


-1 


-1 


1 


1 


-1 


i 


-i 


-1 



1 



-1 



<ll«ii pn -1 -1 -1-1 1 1 -1 -1 
dlWjCJp,, -1-1 1 1 1-1 1 1 

ai 0.6 ... 

Ii = a,d,w 1 i pn -0.6 -0.0 -0.6 -0.6 0.6 0.6 -0.6 -0.6 -0.6 
Ql =aid 1 w 1 q p „ -0.6 -0.6 0.6 0.6 0.6 -0.6 0.6 0.6 -0.6 

First, the i p i, and (j|, n sequences are multiplied by Walsh 
word wi =(1-1 1 -1) repeated every- 4 chips. This result 
is then nuiliiplied by the data sequence d] = 1 for the firs! 4 
chips and i\> = -1 for the next 4 chips, and finally, the two 
sequences are multiplied by the amplitude a\ = 0.6. Values of 
-0.6 were arbitrarily added for time tg to be used later to 
illustrate the effect of time offset. The resulting sequences 
for lo.Qu and Iplji are shown in Table V and their I-Q dia- 
grams are shown in Fig. 7. 

Table V 

Superposition of l-Q Sequences 



Time 


ti 


t2 


t:i 


U 


t5 


Ui 


t7 


t« 


I.) 


-0.8 


0.8 


-0.8 


0.8 


-0.8 


0.8 


0.8 


-0.8 


Qo 


-0.8 


O.S 


0.8 


-0.8 


-0.8 


-0.8 


-0.8 


0.8 


Ii 


-0.6 


-0.6 


-0.6 


-0.6 


0.6 


0.6 


-0.6 


-0.6 


Qi 


-0.6 


-0.6 


0.6 


0.6 


0.(1 


-0.6 


0.6 


0.6 


I 


-1.4 


0.2 


-1.4 


0.2 


-0.2 


1.4 


0.2 


-1.4 


Q 


-1.4 


0.2 


1.4 


-0.2 


-0.2 


-1.4 


-0.2 


1.4 



In the above example, we considered the Situation of a 
CDMA signal consisting of the pilot and code channel 1 and 
showed that we could obtain Ihe I-Q diagram for the com- 
posite signal simply by superimposing Ihe I-Q diagrams for 
the individual signals. For our example of Iwo signals, the 
two 4-point I-Q diagrams produced an 8-point diagram for 
the composite signal. This principle of superposition can be 
applied to any number of code channels and provides a con- 
venienl geomelric way of constructing and visualizing sig- 
nals. For example, if we consider three code channels with 
signal amplitudes of a<|. ai. and a^. then we obtain an I-Q 
diagram wilh coordinates ( x,y) in which x and y take on Ihe 
eight values ta ( |±ai±a2 to produce a signal constellation wilh 
16 points. We must keep in mind thai Ihe above discussion 
applies only for Ihe condition of zero intersymbol inter- 
ference. 



I-O.8.0.8I • 



I-5J. -0JI • : 

Mi 



• 10.8.0.81 



^ '0 8. -0.81 
t,.t S .t, 



|8| 



1-0.6,0.61 ^ — — — " 
If* 

\ 

I ^ 


Q 

- m (0.6,0.6) 
'5 


\ 


1 












\ 






1-0.6.-0.6) 9 


• (0.6, -0.61 


t,,t 2 


«6 



(bl 



(-1.4,1.4) 



(1.4, 1.4) 




Fig. 7. Signal constellation and trajectory f"r (a) pilot channel, 

(in cmie chiii i i and (c) the sum of the pUol channel and 

code channel 1. 
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Signal Acquisition (Timing and Frequency Estimation) 

To perform the measurements of the CDMA signals, it is 
necessary to estimate the precise carrier frequency so llial 
the signal to be measured can be converted to baseband, 
that is, so it can be represented in terms of an I-Q signal tra- 
jectory as discussed above. Furthermore, it is necessary lo 
determine the timing of the signal to be measured relative to 
the zero lime reference of the pseudonoise sequences i,,,, 
and qpn which are used to spread the spectrum of the trans- 
mitter signal. The estimalion of liming and carrier fre- 
quency are discussed in this section. 

Suppose that the transmitter signal to be measured has an 
unknown frequency error Aw unknown phase 9<,, and an 
unknown lime delay in. so that after down-conversion lo 
baseband, the signal available for measurement can be rep- 
resented in the form of equation 7 with w ( . replaced with 
w,+Aw t replaced with I - tg, and a phase term 60 added. 
That is, the signal to be measured can be represented as: 

X(t-T 0 ) = A(t-T„)cos[(w ( .+Aw)(t-T 1) ) + <l>(t-T|,) + B„], (18) 

which can be written, using the trigonometric identity 
cos(9+(p) = cosBcostp - sinBsincp. as: 

X(t-T„) = 

A(i-T,i)cos[An)t - (d) ( .+Ao))T 0 + 4»(t-T 0 )+fl()]cosu)ct (19) 

- A(t-To)sin[A(Dt - (n>,. +A(ri)T() + «J>(t-To) + H 0 |sinii),.i . 

From equation 19, we obtain the in-phase and quadrature 
components as: 



Ix(I) = A(t-To)COS[Ao)t-((i» l . + Ali))T l ) + <P(t-T( l ) + 0 0 | 

and 

Qx(0= A(t-T(i)sin|Awt-((o,.+Aiu)To+<l>(t-To)+Bo] 



(21) 



Using Filler's identity, e 16 = expfjB) = cosB + jsinB. we can 
write the complex envelope as: 



Y(t) = I x (t) + jQx(t) 

= A(.t-T|,)expjj[Aujt-(iu ( .+Aio)To + d>(t-T 0 )+ B () ] 



(22) 



from which we see that the baseband signal is a rotating 
phasor with magnitude A(t-tn) and phase [Awt - (w,.+Aw)To 
+ 0(t-T(i ) + 6n] as shown in Fig. 8. 

We see that if t (l * 0 but Aw = 0, then the amplitude A(l-To) 
and phase <l>(t-T„) are delayed versions of A(t ) and Od) and 
a phase shift of— avtn+Bo ' s added. Therefore, the effect of 
the time delay is simply a rotation of the 1-Q diagram by an 
angle of -uvTo+60 and a change of to in the limes at which 
the signal trajectory passes through the constellation points. 



All -T 0 ) 



•I'll - T 0 ) + \c-it - K + \m)T 0 * » 0 



When Aw * 0. the frequency error adds an additional phase 
shift of -Aiot(i and a constant-rate phase rotation of Awt. 
The result of the constant-rate phase rotation will, in gen- 
eral, be that the signal trajectory will no longer pass through 
discrete points, so the 1-Q diagram will not resemble its 
Counterpart for zero frequency error. 

The functions used to estimate tfl, Ato, and On can be de- 
scribed by considering a pilot reference signal given as: 

S(t-TR.w R ) = A 0 |t-TR|exp{j[i l)R t + *o|i-tr)]J , (23) 

in which Ao(i ) and *o(t) are the instantaneous amplitude 
and phase of the complex envelope corresponding to the 
pilot only. Tr is a variable time delay, and Mr is a variable 
frequency. l ; sing the observable baseband signal Y(t) given 
by equation 22 and the reference signal given by equation 
23. the correlation function for these two signals is: 



PItr.wr) = V Y(t k )S*(t k -TR,W R ) 



(24) 



The sample interval tk-t|(. 1 used here is different from that 
used previously and, in general, would be a fraction of the 
chip interval. The magnitude of P( Tr.wr) could be maxi- 
mized with respect to tr and wr lo determine the estimates 
To and Aw of tg and Aw However, a normalized version of 
the squared magnitude of this function is used to facilitate 
the search strategy for rinding Tn. t» is found by forming the 

function 



|I'(TR,0) 



(20) Sl S ('"- T "- 0 )l S Y (' 



(25) 



and finding the value tr = To for which this function is maxi- 
mum. 

Maximizing equation 25 corresponds to maximizing the cor- 
relation between the observable baseband signal and an 
ideal reference signal for the pilot Only, 1 'sually, the observ- 
able baseband signal will consist of the superposition of a 
number of code channels. However, since the correlation 
between the pilot and the other code channels is small, die 
maximization of equation 25 provides a good initial estimate 

Of T(|. 

P(tr, 0) is sensitive to frequency error Aw. which limits the 
range of Aid for which equation 25 can be used. We can ob- 
lain an expression for the frequency response of P(To.O) by 
selling 



Y(t) = S(t-T (l .Aw) 
to obtain 



PfafcO) = TAg(t k -T 0 )e> v "^ . 



(26) 



(27) 



To simplify the evaluation of this expression, consider sam- 
pling at points for which the signal trajectory passes through 
the constellation points of the pilot, so that Afi(t k -To| is con- 
stant. In this case, the magnitude of P(Tm.O) is: 



llll 



Fig. 8. Complex envi'Iopi- of the baseband signal. 
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P|T„.0). = 



(28) 



w here T is the length of the data record used to calculate 
P(Tq.O) and K is the number of samples in the data record. 

From the sketch of P( i„.0) in Fig. 9. we see that P(iOrO) = 0 
for Atu = 2jkT. hi devising the search strategy for finding To- 
it was assumed that frequency errors would be less than 
±n.T. Therefore, reliable estimates of to can be obtained 
only if 



lAtuI < ^ . 



(29) 



After the value of to is determined, we obtain an estimate. 
Am, of Aw from the discriminator formed as the ratio of the 
difference over the sum of |P(t 0 , Awo)| and |P(to,-Awo||: 



Aw = £ 



|P(in.A(o 0 )| - IPlio.-Au.ol, 



|P(t 0 .Aoj„)| + |P{x„, -At.i 0 )| 



(30) 



where ACOq = ti/T. The formation of this discriminator is illus- 
trated in Fig. 10, where Pjto.Awol is shown by the upper 
dashed curve, -P(th,-Acuo) is shown by the lower clashed 
curve, and the discriminator curve, A(»T/rt. is shown by the 

solid curve. 

The function given by equation 30 is a linear function of AtO 
for lAcol < rc/T and provides a reasonably good initial esti- 
mate of the frequency error when a significant percentage 
(on the order of 10% or more) of the total transmitter power 
is contained in the pilot channel. 

An estimate of the transmitter phase is obtained from the 
phase of the correlation function with = ToandoiR = 
Aid: 



()o = (an"' 



3-|P|i(i.Au>)| 
3t(P|io.Ac))) 



(31) 
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Fig. 10. Formation of the (lisrriminatur of equation 30. P(to.Aii)oI is 
shown by the upper dashed Curve, -P|To.-A<uoi is shown by die 
lower dashed curve, and Ihe discriminator curve. AJiT/jt . Is shown 
by the solid curve. 

where $|z| and 3|z| are the real and imaginary parts of z. 
respectively. 

Because of the weak correlation between the pilot channel 
and the other code channels, equations 25. 30, and 31 pro- 
vide good initial estimates of iq, Acu and 6 U . The estimates 
of these parameters are refined after the intersymbol inter- 
ference has been removed by the complementary filter dis- 
cussed later in this article. Further refinement of these pa- 
rameters is achieved when estimating time and phase 
offsets of the code channels relative to the pilot channel. 
The estimation of Ihe offset parameters is discussed later in 
this article. 

Code-Domain Power Spectrum 

The code-domain power spectrum is given in terras of the 
coefficients pj, where pj is defined as the fractional pan of 
the transmitter power contained In Ihe ilh code channel. 
The first step in calculating the code-domain power spec- 
trum is to multiply I(l|<) and QOk) by i |)n and q,,„. The results 
of these calculations are shown in Table VI. 




Fig. 9. The eon'Halinn nmrlimi !'( :„.<)> as;i fniM limi c.f frequency 

error. 



Time 
Ktk) 
Q(Ik) 
'pi i 

M|.M 

Z| = I i,,,, 



Table VI 
Despreading of l|, and Q|, 



1st Walsh function 
Interval 



-1.4 

M 
-l 
-l 
1.4 
1.4 



t2 
0.2 
0.2 
1 
1 

0.2 
0.2 



-1 
I 

1,1 
1.1 



0.2 
0.2 



2nd Walsh liun-timi 
interval 



'l In 

0.2 -0.2 



t3 
-1.4 

1.4 -0.2 -0.2 
I 



-1 
-1 

0.2 
0.2 



hi 
1.4 
-1.4 
1 
-1 
1.4 
1.4 



'7 

0.2 
-0.2 
1 
-1 
0.2 
0.2 



t8 
-1.4 
1.4 
-1 
1 

ll 

1.4 
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'file code-domain power spectrum is: 



N 

y 



P>~i 



N M 

2v £5>W 

h= 1 k= 1 



(32) 



where Z|,k is (lie kt.li Sample of tin- despread signal in the nth 
Walsh function interval, Rjk is the klh chip or the ith Walsh 
function. M is the number of chips in a Walsh function, and 
N is the number of Walsh function intervals in the measure 
merit interval The calculations of pi, i = 0, 1, 2, 3 for the 
above example are presented in Table VII (j = , - 1). 

Table VII 
Calculation of pj for the Example 



h=l 





R..k 


ZiikR'nk 


Rtk 


Zlikf! Ik 


L.4+JM 


l+j 


2.8 


i+j 


2.8 


0.2+j0.2 


l+j 


0.4 


-i-j 


-0.4 


IA+Sl.4 


l+j 


2.8 


i+j 


2.8 


l).2 + .j(l.2 


l+j 


0.4 


-i-.i 


-0.4 


0.2+j0.2 


l+j 


0.4 


l+j 


0.4 


1. 4+ j 1.4 


N 


2.8 


-i-.i 


-2.8 


0.2+j().2 




0.4 


i+i 


0.4 


1.4+jl.4 


i+i 


2.8 


-i-,i 


-2.8 



y> lk 2 = 8 



k=l 

2 4 



V V| Zhk r = 8 ( 1 ^) + 8(02*) =10 



X X Z '>l< R 0k 
h=1 k 

P.I = 



= 0.4 J + 0.4- = 81.92 



h= 1 k = l 

81.92 = 81.92 
8(16) 128 



0.04 



y 
ii=t 



11 



k=l 



= 4.8- + 4.8- = 46.08 



_ _ 46.08 _ »■«, 
P ' " 128" ~ 0 36 

Pi = p :( = 0 

Po + Pi + P 2 + Pa = 1.0006 • 



(33) 

(34) 
(35) 

(36) 

(37) 
(38) 
(39) 



Sinc e we selected signal amplitudes ag = 0.S and &j = 0.6, the 
total signal energy in our measurement interval (two Walsh 
function intervals) is proportional to (0.8 2 * 0.0-) = 1.0 and 
die percentages of signal energy in the pilot and code chan- 
nel 1. respectively, are 0.8 2 = 0.64 and 0.6 2 = 0.36. We see, 
therefore, that the results of this example verify that pj is the 
fractional part of the energy of the observed signal that is 
contained in the ith code channel. 



Errors 

Various errors will produce a transmitter signal that does 
nol match the ideal reference signal. These errors will mani- 
fest themselves as a distribution of the transmitter signal 
energy among the code channels that varies from the ideal 
distribution. As mentioned earlier, the transmitter signal 
may have an unknown lime reference and canter frequency. 
However, as we saw, these parameters are estimated so that 
they can be removed from the signal to be measured. There- 
fore, frequency errors and lime delay are compensated to a 
sufficient degree of accuracy to haw minimal influence on 
the distribution of code-domain power. 

Other types of errors are not compensated. These include 
signal impairments caused by nonideal components in the 
transmitter such as nonideal niters, nonlinearities. gain and 
phase imbalances, mixer spurs, quantization errors, and 
others. 

Waveform Quality Factor |p), A measure of the quality of the 
transmitter signal is obtained by measuring p, defined as: 



P = 



r ZkRSk 

k 



(40) 



where Zk is the kth sample of the despread signal. W'tik = 
1-j, and only the pilot is transmitted. By comparing equa- 
tions 40 and 32, we see that p and prj are similar but not 
identical. When po is calculated, the energy in code channel 
0 is found for each Walsh function interval in the measure- 
ment interval and the sum of these energies is obtained. 
When p is calculated, the energy of the projection onto R* 0 k 
= 1-j over the entire measurement interval is obtained. For 
random type errors, values obtained for p and p ( , will be 
essentially equal. However, certain types of errors such as 
uncompensated frequency errors will yield different values 
for p and po- 

According to equations :i2 and 40, a fixed phase difference 
between the measured baseband signal and the reference 
signal will not affect p and pj. This is true because these 
functions involve the calculation of energies that are insen- 
sitive tO phase, that is, 

le^o ZRT = iZRT. 

Time and Phase Offset Errors. Time offsets and phase offsets 
of the code channels relative lo the pilot channel are errors 
with tolerances specified in IS-97. Offset errors in a particu- 
lar code channel will cause energy from that code channel 
to leak into other code channels and thereby cause a change 
in the distribution of c ode-domain power. An example of 
time and phase offset errors is considered in this section. 

Suppose there are time and phase offsets of channel 1 with 
respect to channel 0 of At) and A6|. respectively. For illus- 
trative purposes, we will assume that the pulse response of 
the transmit filter is triangular, as shown in Fig. 11, so the 
transmit filter is considered a linear interpolator of adjacent 
input values. We will extend our example by considering the 
effects of offsets of Ati = 0. 1/T,.. where T r is the chip inter- 
val, and A9] = 0.1 radian. We compute I] and Q! for this case 
as presented in Table VIII. 



80 Kebninry lfftWiHewk-ll-Packanl Journal 

©Copr. 1949-1998 Hewlett- Packard Co. 



Table VII! 

Calculation of ,:, for the Example with Time and Phase Offsets 

From timing error (linearly interpolate 90% current value, ION future value) 



Time 


ti 


l 2 


ta 


'i 


In 


Us 


t? 


is 


h 


-0.6 


-0.6 


-0.6 


-0.48 


0.6 


0.48 


-0.6 


-0.6 


Qi 


-0.6 


-0.48 


0.6 


0.6 


0.48 


-0.48 


0.6 


0.48 



From phase error (IjcosO.l - Q|Sin0.1. I]sin0.1 + QjcosO.l) 



h 


-0.5371 


-0.5491 


-0.6569 


-0.537-5 


0.5491 


0.5255 


-0.6569 


-0.6449 


Q. 


-0.6569 


-0.5375 


0.5371 


0.5491 


0.5375 


-0.4297 


0.5371 


0.4177 


lo 


-0.8 


0.8 


-0.8 


0.8 


-0.8 


0.8 


0.8 


-0.8 


Q« 


-0.8 


0.8 


0.8 


-0.8 


-0.8 


-0.8 


-0.8 


0.8 


I 


-1.3371 


0.2509 


-1.4569 


0.2625 


-0.2509 


1.3255 


0.1431 


-1.4449 


Q 


-1.4509 


0.2625 


1.3371 


-0.2509 


-0.2625 


-1.2297 


-0.2629 


1.2177 


uliiply bj i pn and 


4 pn to obtain '/. = Zi+jZy 














1.3371 


0.2509 


1.4569 


0.2625 


0.2509 


1.3255 


0.1431 


1.4449 




1.4509 


0.2625 


1.3371 


0.2509 


0.2625 


1.2297 


0.2629 


1.2177 






Z|,k 


Box 


Zi.kR'ok 


Ki k 


Z|,kH*ik 


M 


1.3371+jl.4569 


m 


2.7940+J0.1198 




2.7940+j0.1198 




0.2509+j0.2625 




0.5134+j0.0116 




-0.5134-j0.0U6 


H 


1.4569+j 1.3371 




2.7940-j0.1198 


i+j 


2.7940-j0.1198 


" 


0.2625+j0.2509 


Irijj 


0.5134-j0.0110 


-H 


-O.5134+j0.0116 



li=2 



0.2509+j0.2025 
1.3255+j 1.2297 
0.1431+j0.2629 
1.4449+jl.2177 



N 
L+J 



0.5134+.j0.0116 
2.5552-j0.0958 
0,1060+ j0. 1198 
2.6626-j0.2272 



Ui 
-1-J 
N 



(I..'l:il+,i0.01l(i 

-2.5552+j0.0958 
0.4060+j0.1198 
-2.6626+.j0.2272 



li=l 



Zhk 

1.3371+J1.4569 
0.2509+,j0.2625 
1.4569+.jl.3.'17l 
0.2625+j0.2509 



l+J 
Hi 
-H 
-l-j 



Z|,kR*2k 
2.7940+j0.1198 

0.5134+j0.0116 
-2.7940+j0.1198 
-O.5134+j0.0116 



Rik 

-1-J 
-H 

m 



Z|,kR*:tk 
2.7940+j0.1198 
-0.5I34-j0.011(> 
-2.7940+j0.1198 
0.5134-j0.0116 



h=2 



0.2509+j0.2625 
1.3255+j 1.2297 
0.1431+j0.2629 
1.4449+.jl.2177 



l+J 

N 
-i-j 
-i-j 



0.5131 +j().OI 16 
2.5552-j0.0958 
-0.4060-j0.1198 
-2.0026+j0.2272 



-i-j 
-i-j 

m 



0.5134+j0.0116 
-2.5552+j0.0958 
-O.4060-j0.1198 

2.6620-j<'-2272 
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T c = Chip Interval 




Fi«. 1 1. Simplified impulse response of the irjiismii filter 

From the values obtained in Table VIII. we compute Ihe 
code-domain power coefficients as follows: 



JjBflf J X ^hk " = 121.1648. 

16.6148)- + I6.1372-j0.1916l 2 



(41] 



k=i h-lk-1 

2 I 
h=l k=l 



= SI. 4575 . 



(42) 



2 4 
h=1 k= 1 



= I4.5612I 2 + l-4.2984+j().4544|- 



= 39.4873 . 



(43) 



y 

h-l 



1 

> 
k=] 



X ZhkR^k = U0.2628I 2 + lj().0232l 2 (44) 
= 0.0696 . 



2 4 

y y ZhkRiSfc 

h= I k=J 



= 10.21641- + 10.21 48-j0.2396l- 



= 0.1504 . 



81.4575 n 
p o = T2TT648 = 06723 

39.4873 
P| 121.1648 

_ = 0.0696 
p -' 121.1648 

0.1504 



= 0.3259 



= 0.0006 



Pa = 



121.1648 



= 0.0012 



(45) 

(46) 
(47) 
(48) 
(49) 



We note thai the liming and phase errors caused some of the 
energy from code channel 1 to leak into the other code 
channels. However, again 



This condition is always satisfied regardless of the errors 
introduced to the data sequence Z =Zi+jZg. 

Estimates of Time and Phase Offsets. We saw in the above 
example that when code channel 1 was offset in time and 
phase relative to the pilot channel, errors were introduced 
that caused the relative energy to increase hi code channels 
0, 2. and 3 and to decrease in channel 1. To determine the 
values of Ihe offset errors, the mean squared difference be- 
tween Ihe observable data, Z. and an ideal reference signal, 
R, is minimized. For the example considered above, the 
errors introduced by liming and phase offsets are equal to 
the difference in 'L\ +jZq for the case of no errors given in 
Table VII and the case with plia.se and time offset errors 
given in Table VIII. These errors as a function of time t^ are 
listed in Table IX. 



Using the listed values, the mean squared error is: 

8 



9 

MSE = iy|E Qk +jE lk r 



k=t 



(51) 



k= I 

= 0.13876. 

To estimate timing and phase offset errors, the active code 
channels are determined by calculating p, for every i and 
identifying the channels for which the values of p, are above 
apresel threshold. Kor example, if a threshold of 0.01 (cor- 
responding to -20 (IB) is used, every channel for which pj > 
0.01 will be declared an active channel. 

In addition to determining the active code channels* it is 
necessary to determine the data sequence dj|, for each active 
channel in which the subscript i denotes the ith code chan- 
nel and the subscript h denotes the hth Walsh function inter- 
val in the measurement interval The data detector incorpo- 
rated into the function used to calculate p, is: 



dn, = sgn K 



y zi,i<Rik 

k 



where 

sgn(u) = 1, u > 0 
= -1. u < 0 



(52) 



(53) 



and &|z| is the real part of z. The index k varies over the 
chips in a Walsh function interval ( k = 0 to 3 in our example). 
From the values tabulated in Table Mil. we can generate the 
detected data as shown in Table X. 



p„ + p, + p 2 + p ;) = 1.0000 . 



(50) 
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Table IX 

In-Phase IE|I and Quadrature IEqI Components of Errors 
in Example for Timing and Phase Offset Errors 



Time 


ti 


t2 


t3 


ti 


t5 


16 


«7 


t8 


ft 


-CI.0620 


0.0509 


0.0569 


0.0625 


0.0509 


-0.0745 


-0.0569 


0.0449 


Eg 


0.0569 


0.0625 


-0.0629 


0.0509 


0.0625 


-0.1703 


0.0629 


-0.1823 



Table X 

Calculations for Data Detection in the Example 



i4i 

0,1 
0.2 

U 

1.2 
2.1 

2,2 
3,1 
3,2 



Pi 

0.6723 (active) 
0.3259 (active) 
o.oooii (Inactive) 

0.0012 (inactive) 



k=l 
6.6148 

6.1372 

4.5612 

-1.2984+j0.4544 



do, 

1 
1 
1 

-1 



After the active code channels and their data sequences are 
determined, an ideal signal of the form of equations 9 and 10 
can he generated for each active code channel. The in- 
phase and quadrature components of the ideal signals are: 



li(t) = A,(t)cos*,(t) 



anil 



Ql(t) = A.fDsindyt) 



(54) 



(55) 



where A,( I ) and <!>,( t ) are the amplitude and phase of the 
ideal signal of the ill\ code channel passing through the 
points (±1.±1 ) in the I-Q diagram as shown in Fig. 6. The 
reference signal is generated by superimposing the ideal 
signals given h> equations •"> I and for each active Code 
channel. The resulting in-phase and quadrature components 
of the ideal reference signal are: 

W) = X "l A i (1 cos [ Ati)l + *' (t " + M (56) 



and 



= V c^A.lt-T.lsin 



Awt + <l>j(t-Tj) + HiJ . (57) 



where Aw is frequency error, dj is the relative amplitude 

l"i = . Pi ) . t, is the lime delay, and II, is llie phase of the it 1 1 
code channel. The summations are over the set of active 
code channels. 

The frequency error, lime delays, and phases are determined 
by finding values of Aw. dj. i) , and §j for all values of i cor- 
responding lo the active code channels to minimize the 
mean squared difference between the observable sequence 
Z(tk) = Z|0k) * jZqdk) and the reference R(t k ) = I ri .rft k ) + 
jQn-rOk). which is: 



k= I 



(58) 



where M and N are the same as in equation 32. To and Aio 
are used to update previous estimates of time delay and fre- 
quency.. Estimates of time and phase offsets obtained from 
t, and Oj are: 



ATj = t, - i 0 



and 



(59) 



A8, = 8,-60 



For the example above, values of Aw. ay t, . and Q t would 
be found to produce zero mean squared difference and er- 
ror-free estimates of these parameters. In general, however, 
errors other than those introduced by timing and phase off- 
sets would be present, so that after the minimization of the 
mean squared difference, a nonzero residual between the 
reference and the observable would exist and the parame- 
ters would be estimated with some error in the estimates. 

Signal Flow Diagram 

The signal flow diagram for the CDMA power, liming, and 
phase offset measurement algorithms is shown in Fig. 12. 
The signal under test from the base station transmitter is 
down-converted to a 3.6864-MHz IF signal thai is sampled at 
4.9152 MSa/s. The digitized IF signal is passed through a 
finite-impulse-response (FIR), linear-phase, digital IF filler 
centered at 1.2288 MHz. This filter has a fiat passband YA 
MHz wide, which is considerably wider than the 1. 23-MHz 
bandwidth of Ihe IF signal and provides blocking at dc and 
359.2 kHz. Indeed, the primary purpose of Ihe IF filler is lo 
block Ihese signal components. 

Following the IF filler, Ihe signal is down-converted to in- 
phase (I) and quadrature (Q) baseband signals. In the down- 
converter, the I and Q signals are filtered by Hat, FIR. linear- 
phase, tow-pass fillers with passbandS front 0 to 700 kHz 
wide and slop bands from 1.16 to 2.0 MHz wide. The full 
sample rate of 4.9152 MSa/s is retained at the output of the 
down-converter to provide maximum accuracy at the corre- 
lator. 

The next function after the down-converter is the correlator, 
which provides an estimate of the timing of the signal under 
test. Ihe inputs to the correlator are the baseband signal 
from Ihe down-converter and an internally generated refer- 
ence signal. This reference signal is Ihe mathematically 
ideal signal that would be present at the output of Ihe down- 
converter if only Ihe pilot signal were transmitted. The time 
origin of the reference signal corresponds lo the firsl binary 
1 following 15 binary 0s of the pseudonoise sequences i pn 
and q,,„ , as specified in the IS-H'i standard 
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Fig. 12. Signal Bow diagram for 
the HP 83203B CDMA power, tim- 
ing, and phase offset measurement 
algorithms. 



The correlator performs the liming acquisition described 
earlier by rinding the value of tr thai maximizes the function 
given by expression 25. Since this fund ion is sensitive to 
frequency error, I he correlator works reliably over a limited 
range of frequency. If T is the length Of the record (in sec- 
onds) used in the correlator, then the maximum frequency 
error for which I he correlator will provide reliable acquisi- 
tion is: 

Af max = i. (60) 

For example, if a 1.25-ms time record is used, then the maxi- 
mum frequency error that will allow reliable acquisition in 
time is ±Af niax = ±4001 Iz. 

Afler the time delay id is determined, the baseband signal is 
lime-aligned with the reference signal. This function is per- 
formed in the synchronizer, which consists of a pair (for I 
and Q) of low-pass filters that resample the signals ai a rale 



of 2.4576 MSa/s with a variable lime delay to introduce the 
appropriate timing. 

The synchronized baseband and reference signals are used 
in the frequency and phase preestimator to obtain initial 
estimates of the carrier frequency and phase as given by 
equations 30 and 31. These estimates are then used in the 
frequency and phase compensator to largely remove Aid and 
On from the baseband signals. 

Afler obtaining a baseband signal ihal is compensated in 
frequency and phase, the next step is to remove the inler- 
symbol interference introduced by the transmit filler. This 
step is necessary to ensure the orthogonality of the code 
channels to allow calculation of the code-domain power 
coefficients by the algorithm discussed earlier. Inters.vmbol 
interference is removed by the complementary niter, which 
when cascaded with the transmit filter produces an overall 
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filler response' that satisfies Nyquist's criterion for zero inter- 
symbol interference. 

After the intersymbol interference is removed from ihe 
baseband signal by the complementary filter, refined esti- 
mates of the <-arrier frequency and phase are obtained In 
minimizing the mean squared difference between the base- 
band signal and a reference signal consisting of only the 
pilot. The procedure used here is similar to that used for 
estimating the frequency and phase iti conjunction with the 
time and phase offsets as described earlier. After the inter- 
symbol interference has been removed, it is unnecessary to 
include the effect of the transmit filters; this allows the pilot 
sequences to be used directly as the reference signals. 

Afler Ihe refined estimates of carrier frequency and phase 
are obtained, the baseband signal is again passed through a 
compensator and a complementary filler to improve the 
removal of frequency error, phase error, and intersymbol 
interference from the baseband signal. 

Following this second stage of compensation, the baseband 
signal is ready to he used for calculating p, as described ear- 
lier. This function is performed in the p, calculator shown in 
the signal flow diagram. Data bits are also detected in this 
function that are needed to calculate the reference signal 
used for estimating time and phase offsets of code channels 
as described earlier. This function could also be used to cal- 
culate the waveform quality factor p. However, this parame- 
ter is actually calculated by another function developed for 
Ihe HP S3203A using the procedure given in an earlier sec- 
lion. 

The final steps in the signal flow diagram involve determin- 
ing the lime offsets and phase offsets of the active code 
channels relative to the pilot channel. To estimate these off- 
set parameters, it is necessary to generate an ideal reference 
signal Corresponding to the active code channels in which 

the amplitudes, phases, time delays, and frequencies of all off 

the code channels in the reference signal can be controlled. 
The function that generates this ideal reference signal, re- 
ferred lo as Ihe referriicf sh/ual si/nllicsizrr, is invoked by 
Ihe parameter estimator, which uses a search procedure lo 
minimize Ihe mean squared difference between the base- 
band lest signal anil the synthesized reference signal as de- 
scribed earlier. 

Accuracy of the Measurement Equipment 

Specifications for the HP 88203B (HP &921 A/GOO) are war- 
ranted performance. These specifications are derived from 
the accuracy of the measurement algorithms, environmental 
considerations, measurement uncertainties, unit-to-unit vari- 
ations, and customer specification margins. Typical perfor- 
mance of ihe IIP83203B is significant l> belter than Ihe pub- 
lished specifications. 

The minimum performance of a base station transmitter is 
specified in the IS-97 standard. In section 11.1..! of this stan- 
dard. Table 1 1. 1.3.1. reproduced here as Table XI. specifies 
Ihe frequency lolerancc. lime reference, pilot waveform 

quality, and RP power outpul variation. 



Table XI 
Environmental Test Limits 
(from Table 11 1 3-1 in IS-97 Standard) 



Parameter 

Frequency 
Tolerance 

Tune Reference 

Pilot Waveform 
Quality 

RF Power Output 
Variation 



Limit 

± 0.05 ppm 

±10 ps 
p> 0.912 

+2 dB. -4 .IB 



The carrier frequency of the RF signal to be tested is ap- 
proximately 900 MHz. so the frequency tolerance given 
above corresponds to an absolute frequency tolerance of 
±45 Hz. Since the HP 83203B can acquire a signal and accu- 
rately estimate the frequency error when the frequency er- 
ror is as large as ±400 Hz for a 1.25-ms measurement inter- 
val, frequency errors within the above tolerance are easily 
acconunodated 

The tolerance on pilot waveform quality significantly im- 
pacts the accuracy of the measurement algorithms. Error- 
vector-magnitude-squared (evnrl. which is defined as the 
ratio of the energy of the error to the energy of the error-free 
transmit signal, can be shown to be approximately related to 
the waveform quality factor, p , as: 



evm 



- 1 . 



For Ihe value ofp = 0.912 in Table M, 



evm ~ M- 1 = 0 31 • 



(61) 



(62) 



thai is. the waveform quality specified in Table XI Corre- 
sponds to a signal with an nns error of approximately 31%. 

( ither errors that impact the accuracy of Ihe measurement 
equipment are time errors and phase differences between 
Ihe pilot channel and other code channels. Tolerances on 
these errors are given in sections 10.3.1.2.3 and 10.3.1.3.3 of 
the IS-97 standard as less than ±50 ns for time errors and 
less than +50 tnrad for the phase differences. 

The accuracy of the waveform quality measurement equip- 
ment is specified in Table 12.4.2.1-1 of the IS-97 Standard, 
repealed here as Table XII. 

Waveform quality is measured when only the pilot is trans- 
milled. We will discuss the accuracy in measuring each of 
the parameters listed above and the measurement interval 
necessary "> achieve the performance specified. 

To measure code-domain power, test models for Ihe base 
Station are specified in Table 12.5.2-1 of Ihe IS-97 standard, 
reproduced here as Table XIII 
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Table XII 

Accuracy of Waveform Quality Measurement Equipment 
(from Table 12.4.2.1-1 in the IS 97 Standard) 



Parameter Symbol 

Waveform Quality p 

Frequency Error Af 
(exclusive of test 
equipment 
time-base errors) 

Pilot Time Alignment if. 



Accuracy 
Requirement 

±5xl0 _1 from 
0.9 to 1.0 

± 10 Hz 



±135 ns 



Table XIII 
Base Station Test Model, Nominal 
(from Table 12.5.2-1 in the IS-97 Standard) 



Number Fraction Fraction 
of of Power of Power 
Type Channels (linear) (dB) 



Pilot 

Syne 

Paging 
Traffic 



0.2000 
0,0471 

0.1882 
U.IHM12 



-7.0 



Comments 



( 'ode c hannel 0 



-13.3 Code channel 
32, always 
1/8-rate 



-7.3 



-10 



Code channel 
1, full-rate only 

Variable code 
channel 
assignments; 
full-rate only 



The measurement algorithms have been tested and found to 
provide accurate results for signals with less than 10% of the 
power in the pilot channel; however, in discussing the accu- 
racy of the measurement algorithms in the next subsection, 
we will only consider performance under the conditions 
prescribed by the nominal test model. 

The accuracy required of the code-domain measurement 
equipment is given in Table 12.4.2.2-1 of the IS-97 standard 
using the nominal test model given above. This table is re- 
produced here as Table XTV. 

We will discuss die accuracy of measuring each of the pa- 
rameters given in Table XTV and give the minimum measure- 
ment intervals and number of subestimatcs that must be 
av eraged to achieve the accuracies specified. 

Accuracy of the Measurement Algorithms 
Dynamic Range. The flatness of the filters and the numerical 
accuracy of the computations used in all of the signal pro- 
cessing algorithms for the HP 83203B are closely maintained 
to produce a computational error lev el of approximately 
-55 dB. Since this error level is typically less than the level 
of the spurious signals and quantization noise introduced by 
the analog down-conversion process and the analog-to-digital 
converter (ADC) used to digitize the IF signal under test, the 
dynamic range of the HP 83203B is limited by the noise and 
spurious signal level at the output of the ADC. The ADC 
uses autoranging to maintain the signal level at the input of 
the quantizer at -1 dB to -10 dB from saturation. With the 
ADC operating at -10 dB below saturation, the noise and 



Table XIV 

Accuracy of Code-Domain Measurement Equipment 
(from Table 12.4.2.2-1 in the IS-97 Standard) 



Parameter 

Code-domain power 
coefficients 



Frequency Error 
(exclusive of test 
equipment time-base 
errors) 

Code-domain time 
offset relative to 
pilot 

Code-domain phase 
offset relative to 
pilot 



Symbol 



P, 



AO, 



Accuracy 
Requirement 

±5x10-* 
from 5 x lO" 4 
to 1.0 



= 10 Hz 



±10 ns 



±0.01 radian 



Spurious signal level at the output of the ADC is approxi- 
mately -45 dB relative to the digitized IF signal. Therefore, 
the analog and ADC hardware places a limit on the dynamic- 
range of the code-domain [lower measurements of approxi- 
mately 45 dB. 

Accuracy in Measuring p and p,. The accuracy in measuring 
waveform quality p and code-domain power p, depends on 
the accuracy of estimating time delay iq and frequency error 
Aco. The erroi's in the measurement of p produced by errors 
in estimating To and Aco are shown in Figs. 13a and 13b for 
measurement intervals of 1.04 ms and 2.08 ms. The error 
curves correspond to transmitting an ideal pilot channel for 
which the true value of p is 1.0. Since the percentage error 
in the measurement of p caused by frequency and timing 
errors is independent of the true value of p. the error curves 
presented here apply to values of p from p = 1.0 to p<0.1. 
From Table XII. we see that the required measurement accu- 
racy specified in the IS-97 standard is ±5 x 10~ 4 for p = 0.9 to 
1.0. This tolerance corresponds to a measurement error of 
-33 dB for p = 1.0, which is shown in Figs. 13a and 13b. 

According to Table XII. frequency error must be measured 
to an accuracy of ±10 Hz and pilot time alignment must be 
measured to an accuracy of ±135 ns. The uncertainty in the 
lime reference of the ADC and errors of the time-delay esti- 
mator contribute to the measurement errors of pilot time 
delay. In the HP 83203B, the ADC will contribute less than 
±125 ns error and the time-delay estimator will contribute 
less than ±10 ns error to the pilot time alignment measure- 
ment. Therefore, for purposes of determining the accura- 
cies in measuring p and p„ we can assume that limits on the 
errors of the measurements of tfl and Aco are: 



-10 ns < T{) - r,i < 10 ns 



and 



(63) 



-10 Hz < Ad) - Aco < 10 Hz. 



From the error curves in Fig. 13, we see that if the toler- 
ances given by equation 63 are achieved, then for a measure- 
ment interval of 1.04 ms, the accuracy requirement for mea- 
suring p is achieved. If a measurement interval of 2.118 ins is 
used, then a liming error of < 10 ns is satisfactory. However, 
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Frequency Error (Hz) 
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Fig. 13. Emirs In tin- measurement of signal quality produced In 
eiTors in estimating (a) Tn and Hi) Aw for measurement intervals of 

1.04 ins ami J OH ins. The error curves correspond lo transmitting 
an ideal pQot channel lor which the true value of p is 1.0 anil are 
valid fur p = n.l iop = LO. 

for the longer measurement interval it is necessary i«> re- 
duce the tolerance of the frequency error to £ 6 Hz. We can 

effectively get a longer measurement interval and avoid the 

lighter tolerance on frequency error by averaging several 
measurements, as considered later. 

The errors caused in the measurement of po by errors in 
estimating to and AO) are presenter! in Figs. 14a and 14b. The 
error curves correspond to transmitting an ideal pilot in 
which the true value of pg is 1.0. litis is the same as the 
signal model used for the curves in Fig. 13. We see that the 
errors caused by timing and frequency errors are relatively 
insensitive to the measurement interval when measuring 
code-domain power. The reason for this is the difference in 
the lengths of the correlators used for the code-domain 
power and waveform quality calculations. For code-domain 
power, correlated energies are computed over subintervals 
one Walsh fiuiclion interval in length and then 20 of these 
energy computations are averaged in the case of the 1.04-ms 
mi'a.siiremeut interval, or 10 are averaged in the rase of the 
2.08-ms measurement interval. For the waveform quality 



calculation, the correlated energy over the entire measure- 
ment interval is computed. Because the length of the corre- 
lator used for p is a factor of 20 or 40 greater than the length 
used for p,. the measurement of p is much more sensitive to 
uncompensated frequency errors than the measurement 
ofp,. 

From the error curves in Fig. 14. we see that if the toler- 
ances given by equation ttl are aclueved, then the accuracy 
requirement for po given In Table XIV is achieved. Again, as 
with p. the percentage error in measuring p<, is independent 
of the tme value of po. 

The curves in Fig. 14 were obtained for po. However, since 
all code channel measurements experience essentially the 
same sensitivities to tinting and frequency errors, these 
curves apply to any pj. i = 0. 1, _., «53 within the dynamic 
range of the equipment. 

Since the dynamic range of the code-domain power mea- 
surement equipment is approximately 45 (IB. precise values 
of code-domain power, well witliin the tolerances specified 




-50 



(a) 



-25 0 25 

Frequency Error (Hz) 




-50 -40 



3(1 



(b) 



-20 -10 0 10 20 
Timing Error Ins) 



50 



Fig. 14. Errors caused in the measurement of p„ hv errors i" 
estimating (a) Xn and (l>) Am The error curves correspond to 
transmitting an ideal pilot in which the true value or p M is 1 0 
(same signal model as for Fig l-'t) The results for p, for l * u 
are essentiallv the same 
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by the IS-07 standard, ran bp obtained for pj = 1.0 to pi = 
3.2 • 1" ■' if I lie tolerances on die estimates of liming and 
frequency errors are satisfied. To observ e code-domain 
power to a level of -45 dB. it would be necessary to use a 
test signal with a waveform quality factor of p > 0.99997, 
where the errors are uniformly distributed in power over the 
64 code channels. 

The measurements of p and p, may have error components 
that are random. Moreover, if a sequence of measurements 
is made from independent data records, then the random 
errors for the independent records are uncorrelated. To re- 
duce the random error components added to the measure- 
ments of p and p,. averaging of a set of measurements ob- 
tained from the independent records can bp performed. To 
perform this averaging, it is not appropriate to average the 
values obtained for p and p, directly, since this would intro- 
duce a bias to the final result. Rather, the energy terms con- 
tained in the numerator and denominator of equation 40 for 
p and equation 32 for p, are averaged separately, and then 
the final values are obtained as the ratios of these averages. 
This mode is referred to in the IIP 83203B as "Past Code-Do- 
main Power with Averaging." 

Accuracy in Measuring and AQ%, The performance of the 
algorithms for the code-domain parameter estimator was 
tested by performing simulations in which Gaussian random 
errors were addeel to the simulated transmitting signals. 
A theoretical expression was derived for the standard devi- 
ation of the estimates of phase offsets, A6|, based on the 
same mathematical model used for the simulations. It was 
found that the results obtained from the simulations agreed 
very well with the results obtained from the theoretically 
derived equation, with differences of less than 10 percent. 
Moreover, it was found (hat the error in estimating time off- 
sets, At,, when measured in nanoseconds, was approxi- 
mately one-half the error in measuring phase offsets mea- 
sured in milliradians. Since the tolerances on measurement 
accuracy given in Table XIV are ±10 nanoseconds for time 
offsets and ±10 milliradians for phase offsets, the measure- 
ment interval is governed by the accuracy requirement for 
phase offsets. To measure rime offspts and phase offsets to 
the accuracy specified in the standard, it was found neces- 
sary to average subestimales of these parameters. A note- 
worthy outcome of the performance analysis discussed 
herein is that the algorithms designed for the code-domain 
parameter estimator indeed minimize the sum-square differ- 
ence between the actual transmit signal and the estimated 
ideal transmit signal, as specified in the IS-97 standard. 

The expression derived for the rms error of the estimate of 
the phase of a code channel is: 



H - * BNT 



(64) 



where evm is the effective error-vector magnitude, which is 
equal to the ratio of the total energy of the error divided by 
the energy of the code channel signal in question, B = 615 
kHz is the bandwidth of the baseband transmit signal, T is 
the measurement interval for one subestimate of the phase, 
and N is the number of subestimates averaged to obtain the 
estimate of phase. 

The worst case occurs for the sync channel, which for the 
nominal test model given in Table XIII has 4.71% of the total 



transmit energy. If the waveform quality factor for each ac- 
tive code channel is p = 0.912. then the effective evm 2 for 
the sync channel is given approximately as: 



evm - = 



0.0471 ~ U,J - 



(65) 



If the measurement interval is T = 2.0 ms (2.2 ms was used 
in the simulations) and the number of subestimates aver- 
aged is 34. then the resulting rms error of the estimate of the 
phase of the sync channel is: 



1 



2.01!' 



= 3.50 mrad. 



esyno 2 y (615)(34)(2.0) 
The effective evm- for the pilot channel is: 
1/p-l 



evnr = 



0.2 



= 0.4825 . 



(66) 



(67) 



from which, for the same conditions as for the sync channel, 
we obtain the rms error of the estimate of the phase of the 
pilot channel as: 



1 



0.4825 



= 1.70 mrad. 



BpBw 2^(615X34X2.0) 
Since the phase offset of the sync Channel is: 

AOsynr = C.sync- ~ flpilot • 

and 0 SV „ C and 8 p uoi are uncorrelated, 



o . = /tr? + ai „ = v 3.5- + 1.7- 

ABsyne y Hsyiir Hpilol 

= 3.89 mrad. 



(68) 



(69) 



(70) 



The estimates of phase are obtained from the sum of 25 sub- 
estimates in which the errors in the subestimates are essen- 
tially independent. Therefore, the estimate of phase offset is 
well-approximated as a Gaussian random variable. Using the 
Gaussian approximation, the 99% confidence interval for the 
estimate of the phase offset of the sync channel for the nom- 
inal test model is: 



99% confidence interval = ±2.57o 



ABsyne 

= ±10 mrad 



(71) 



The measurement accuracy requirement for AG, given in 
Table JEtV is an absolute ±10 milliradians. If we interpret (his 
as (he 99% confidence interval, then the accuracy require- 
ment can be achieved by averaging 34 estimates obtained 
using a 2.0-tns measurement interval as demonstrated by the 
above example. Other combinations of N and T can be used 
to achieve the required accuracy, provided that (lie value of 
T is not too small to allow acquisition of frequency and lim- 
ing. It is recommended (hat a measurement interval of 
Ta 1.0 ms be used to obtain reliable performance. Other 
combinations of N and T that will allow measurement errors 
for AO, of less than ±10 mrad are presented in Fig. 15. As 
pointed out above, if AGj is measured to the accuracy re- 
quired, then the accuracy requirement for Atj will also be 
achieved. We wish to emphasize that the accuracy of the 
measurements of At, and AG, depends on the waveform qual- 
ity and the percentage of power in the code channel being 
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Number of Averages N 

Fig. 15. l.nwvr I h nun Is on N'T fur A8, measurement errors less 
than ±10 milliradians fur various confidence levels. 

measured The curves in Fig. 15 represent a worst -case situ- 
ation in which the waveform quality is p = 0.912 for all code 
channels anil only 4.71% of the transmitter power is con- 
tained in the code channel being measured. For other test 
models, the lower bounds on NT can be obtained following 
the example given above and. for larger values of p. would 
ue significantly lower than those given in Fig. 15. 

Accuracy in Measuring to and Acs. The accuracy in measuring 
p and pj is primarily dependent on the accuracy of the esti- 
mates of T ( | and Aw as shown in Figs. 13 and 14. If To and Aw 
were obtained precisely, then the magnitude of the errors in 
the values obtained for p and pi would be less than l(H, 
which is well within the accuracy specified for the 
HP 832038. 

The best accuracy for the estimates of To and Aw is obtained 
when the full parameter estimator is employed to estimate 
the lime and phase offsets of code channels. In this case. T, 
and 0, are determined for all active code channels and the 
estimate of Am is obtained jointly with the estimates oft, 
and 6 ( . 

The next best accuracy for the estimates of to and Ao> is 
obtained by using a reference signal synthesized as the sum 
of the reference signals for ail active code channels, as is 
done for the full parameter estimator, bill with the time and 
phase offsets set equal to zero in the parameter estimator. 
Tins procedure reduces the search for phase and timing 
from a 2K-dimensional problem, where K is the number of 
active code channels, to a 2-dimensional problem. 

The accuracy of the estimates of To and Aw was determined 
through simulations in which the nominal signal model was 
used with random time and phase offsets introduced to the 
code channels and a measurement interval of 1. 0!l ms. Tim- 
ing and phase offsets that were uniformly distributed over a 
range of ±50 ns for lime offsets and r">() mrad for plia.se off- 
sets were introduced. The results of these simulations are 
presented in Figs. Hi and 17. which show the rms errors of 
the estimates of To and Aw, respectively, as functions of p. 
From Fig. 16, we see that the estimates of To obtained from 
the -l-dimensional parameter estimator are nearly as accu- 
ral! as those obtained from the full 2K-dimensional parame- 
ter estimator. On the other hand, we see from Fig. 17 that 




o- 1 1 1 

0.8 0.85 0.3 095 1 

Waveform Quality Factor p 



Fig. 16. Ruts error of the estimate Of to as a function of signal 
quality p, determined through simulations in w hich the nominal 
signal model was used with random t [me offsets of 0 to ±50 ns and 
phase offsets of 0 to ±50 mrad introduced to the rode channels 
and a measurement interval of 1.0ft ms. 

the full parameter estimator provides roughly a factor of 
two less error in estimating frequency compared to the 2-di- 
mensional parameter estimator. These curves show that 
there is little advantage in using the full parameter estimator 
unless time and phase offsets are outputs of the measure- 
ment. Therefore. I he second method of obtaining estimates 
of To and Aw is recommended when measuring code-domain 
power without measuring time and phase offsets. A mode 
in the HP 83203B referred to as "Accurate Code-Domain 
Power" employs this second method of obtaining estimates 
of To and Aw. 

The third method for obtaining estimates of x u and Aid uses a 
reference signal consisting of only the pilot signal. This 
mode is referred lo as "Fast I ode-I lomain Power" in the III' 
83203B. If only the pilot channel is transmitted, then this 
mode is as accurate as the other two and is appropriate for 
measuring code-domain power. Moreover, if To and Aw are 




Wavelorm Quality Factor p 

Fig. 17. Rnu erroi of tin estimate of Aw as a function of signal 

quality p. determined through simulations using the same signal 

model us for rig. itj, 
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known a priori, then the "Fast Code-Domain Power" mode 
should be used. 

Presented in Fig. 18 are curves obtained from simulations 
showing the rms error in estimating To and Aci) for the case 
in which only the pilot channel is transmitted and a mea- 
surement interval of 1.09 nis is used. Curiously, Ihese curves 
show (hat the timing errors in ns and the frequency errors in 
Hz are nearly identical. If we assume lhat the measurement 
emus are Gaussian, then we can obtain ihe 99% confidence 
limits for the measurement of x„ and Am by multiplying the 
rnis values given in Fig. 18 by a factor of 2.57. To obtain the 
measurement error of less than ±10 ns for To and less than 
±10 Hz for Aco as specified in Table XIV with a confidence of 
99%. the rms errors in measuring To and Aco must be less 
than .3.9 ns for To and less than 3.9 Hz for Mix From Fig. 18. 
we see that Tq and Am can be estimated t<> sufficient accu- 
racy for 0.85<p<1.0 using a measurement interv al of 1.09 ms. 
This exceeds the range of 0.9<p<1.0 specifier! in Table XIL 

Referring to the performance curves in Figs. 16 and 17, we 
see that if p is less than approximately 0.97. then the perfor- 
mance given by these curves may not be adequate. If it is 
necessary to obtain better estimates of To and Ac> than those 
given in Figs. 16. 17. and 18, then it will be necessary to use 
a longer measurement interval than the 1.09 ms considered 
here, or to average estimates obtained from independent 
time records, as is done for the time and phase offset mea- 
surements. As for the time and phase offset estimates, the 
rms errors of the estimates of To and Aco are proportional to 
l/.'NT. 

Measurement Examples 

Typical results obtained with the HP 8921 A cell site test set 
using the IIP S3203B measurement algorithms are presented 
in Figs. 19 and 20. These results are not intended to validate 
any particular base station, but are presented only to illus- 
trate actual measurements obtained using the algorithms 
discussed in this paper. The results presented in Fig. 19 
were obtained from a base station transmitter in which the 
pilot, paging channel 1. sync channel 32, and one full-rate 
traffic channel 11 were active. From Fig. 19a, we see that 
the floor of the code-domain power is at approximately 




0.85 0.9 0.95 1 

Waveform Quality Factor p 

Fig. 18. ( 'urves obtained from simulation* sliou mg the rms error in 
estimating Tr> and Au) for the case in which only the pilot channel is 
transmitted and a measurement interval of 1,09 ms is used, 



-38 clB relative to the total transmitter power which corre- 
sponds to a relative error energy level of -38 dB + 18 dB = 
-20 dB. The factor of 18 dB corresponds to the distribution 
of energy to 61 code channels. The floor level of -38 dB cor- 
responds to a value of p approximately equal to: 



P = 



1 



1 + 10 



-2.0 



= 0.9901 . 



(72) 



The value of p measured was 0.9882. From the measured 
value of p we can calculate the approximate value of the 
floor level of the code-domain spectrum as: 



Floor Level - 101og| 0 (l/p - 1) - 18 
= -37.23 dB, 



(73) 



which agrees closely with the floor level we see in Fig. 19a. 

From the plot of code-domain power in Fig. 19a, we see that 
code channel 33 is significantly above the floor, even though 
code channel 33 was not active. This is an indication that 
the active code channels were leaking energy into code 
channel 33. It should be pointed out thai Ihe base station 
was overdriven during this measurement, which could be 
seen from a measurement of the spectrum of the trans- 
mitted signal. The plot of the measured spectrum is not in- 
cluded in t his paper. 

Measurements of time offsets and phase offsets obtained for 
a measurement interval of 1.25 ms are presented in Figs. 19b 
and 19c. F"or these measurements no averaging was used; 
therefore, the value of N'T to use in equation 64 to determine 
the accuracy of the measurement is NT = 1.25 ins. The chan- 
nel with the smallest energy level was the sync channel 32 
for which the relative measured energy level was -12.8 dB. 
This corresponds to 5.25% of the energy in the sync channel. 
By using equation 65 with p = 0.9882, we obtain an effective 
evm- for the sync channel of: 



VP - 1 
0.0525 



= 0.227. 



(74) 



Using this value in equation 04, we obtain for the rms error 
of the estimat e of t he phase of the sync channel: 



0.227 



Bsync 



2 y i 



= 8.6 mrad. 



(75.) 



The relative power in the pilot channel was -1.41 dB which 
corresponds to 7.73% of the total energy in Ihe pilot. By fol- 
lowing the above procedure for the pilot channel, we obtain 
the rms error for ihe estimate of the phase of the pilot chan- 
nel: 



°"u i . = 7.3 mrad. 



(76) 



I 'sing the rms errors obtained above in equation 70. we ob- 
tain the rms error in the measurement of the phase offset of 
the sync channel: 



%yno = ^ 8 - 62 + " 32 = H.3 mrad. 



(77) 



and by using the Gaussian assumption used for equation 71 
we obtain: 
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Fig. 19. Results of code-domate 
measurements nf a liase station 
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!»!l"n confidence interval = ±2.57ti,„ 

.78) 

= ±29 nirail. 

Thus, from the results of the simulations discussed pre- 
viously, we can expect a 99% confidence interval for the 
measurement of time offset of approximately ±14.5 ns. 

From Fig. 191). we see that the measured lime offsets are 
Within the ±50-ns tolerance given in the IS-97 standards, with 
the worst-case 17-ns time offset occurring lor the paging 
channel. The time offset specification is satisfied even if we 
include ihe ±14. 5-ns confidence interval. From Fig. 19c. we 



see that the phase offsets for I he sync Channel and Ihe traf- 
fic channel are well within the ±5l)-tnrad tolerance given by 
the standard However, the measured phase offset for the 
traffic channel was 91.8 mrad. which is oulside the tolerance 
specified by the standard. 

For ihe lime and phase offset measureinenls presented here, 
the confidence intervals for the measurements were larger 
than could be used for valid tests. As discussed in Ihe section 
on accuracy above, lo obtain acceptable measurement accu- 
racy il is necessary lo average estimates of time and phase 
offsets. For Ihe incasuremcnl situation of Fig. 19, acceptable 
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Fig. 20. Results of code-domain 
measurements nfa base station 
transmitter with thp pilot (chan- 
i if 1 0), paging channel (1), sync 
channel (:)2), and four full-rate 
traffic channels @i, 6, 7. S) ac- 
tive, (a.) Code-domain power 

measurements. 0>) Time offset 
measurements. (<■) Phase offset 

measurements. 



measurement accuracy would have been achieved by aver- 
aging nine estimates to reduce the measurement confidence 
intervals by a factor of S. 

The results of the code-domain measurements of a base 
station transmiiier in which four full-rate code channels 5. 6, 
7. and 8 are active are presented in Fig. 20. hi litis case, we 
see that a significant amount of energy is leaked to inactive 
code channels. From Figs. 20b and 20c, we see thai I he larg- 
est time offset and phase offset are -15.6 ns and 69 mrad. 
respectively, for the sync channel. For these results, a single 
measurement interval of 1.25 ms was used, which results in 
large measurement confidence intervals. 
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born in Taichuny. Taiwan. He is married and likes 
outdoor activities such as bicycling, Rollerblading, 
and racquet spoils 
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A Richard Dugan is a pfoiect 
manager at the Optical Com 
mumcabons Division and 
manages fiber-optic and link 
IC development He earned a 
BSEE degree m 1982 from 
the University of California 
at Santa Baibara and an 
MSfE degree m 198S from 
Stanford University. He joined HP's Microwave Sys- 
tems Division in 1982 He's contnbuted to microwave 
device characterization and has worked as an IC de- 
signer and a design group manager He is profession- 
ally interested in high-speed ICs and data communi- 
cation and has coauthored two papers on miners and 
modulators Richard was born in Pittsburgh. Pennsyl- 
vania He has been married for seventeen years and 
has three-year-old twins His hobbies include cycling, 
fishing, and cooking 
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Bom in Hong Kong, Benny 
Lai received a BSEE degree 
m 1982 and an MSEE degree 
in 1983, both from the 
University of California at 
Berkeley. He joined HP's 
^^Hi^^^. Miciowave Systems Division 
^Hfe m '981 and worked on 

device modeling and simula- 
tion techniques, microwave amplifier design, deci- 
sion circuit design, and the G-link chipset He then 
transferred to the Optical Communications Division 
where he is a principal member of the technical staff, 
currently responsible for 622-Mbit/s clock data recov- 
ery ICOR) postamphfier design and bbie channel arbi- 
trated loop IC design For the fiber channel chipset he 
contributed to the design of the transmitter and 
receiver architecture and the phase-lnr.ked Inop He 
also worked on the logic library and array designs He 
is named as an inventor in two patents on CDR archi- 
tecture and the G link ending scheme He has two 
patents pending on the integrator and the loss-of- 
signal detector He has authored three papers on the 
CDR IC and lhe G-link IC chipset Benny is married 
and enioys gardening and landscaping 
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t Born in Honolulu. Hawaii, 
^ Margaret Nakamoto was 
j^k awarded a BSEE degree in 
^ '989 from the University ol 
■ Hawaii and an MSEE degree 
M in 1992 from Stanlnrd 

University She pined HP's 
Wmj- 1 'Wave Systems Divi 
w ^ in 1989 and contributed to 
microwave IC characterization and GaAs IC design 
Then, as a member ol the technical stall at the Opti- 
cal Communications Division, she contributed to G- 
link IC characterization and tesi support and worked 
on I/O cell design, chip verification and simulation, 
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channel chipset. Currently she is responsible lor fibre 
channel arbitrated loop IC design She has authored 
an IEEE paper on GaAs IC design Margaret is mar- 
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Joe Gilray is an R&D engi- 
neer at the Integrated Or- 
cuits Business Division m 

involved with the support of 
high-level design tools for 
syntnesis and simulation 
and the development and 
support of ASIC design 
methodologies He joined HP in 1984 at the Logic 
Systems Division He initially worked as a membei of 
the technical staff building and supporting the CAEE 
software that links the HP Design Capture System to 
the system GenRad HILO® simulator. Ouring this 
time, he authored a paper on the integration of soft- 
ware and hardware simulation He was the process 
manager for the HDL code inspection process and 
often acted as moderator for individual code inspec- 
tions Before coming to HP. he worked a: Compion 
Systems as a hardware and software system designer 
He is professionally interested in object-oriented pro- 
gramming in Ch- He was awaided a BSEE degree 
from the University of Illinois at Champaign-Urbana 
Joe was bom in Waukegan. Illinois. He is married 
and has a son and daughter He likes bicycling and 
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Ray Birgenheier has been a 
consultant and development 
engineer in digital signal 
processing and digital com 
munications at HP's Spokane 
Division since 1981 He con- 
tributed to the standaids on 
modulation accuracy pre- 
pated by two subcommittees 
of the Telecommunications Industry Association and 
pioneered the development of techniques and algo- 
rithms lor measuring modulation accuracy and code 
domain power of cellular radio transmitters He is 
named as an inventor in two patents on premodula 
tion filters and two on modulation measurement 
techniques and apparatus He developed the modula 
Hon measurement algorithms lor the HP 1 1847A. HP 
83203A. HP 83203B. and HP 8924C measurement 
systems, which verify the RF performance ol TDMA 
and CDMA digital cellular transmitters He received a 
BSEE degree in 1963 from Montana State University, 
an MSEE degree in 1965 from the University of 
Southern California, and a PhD degree in 1972 in 
electrical engineering from the University ol Califor- 
nia at l.os Angeles He worked for Hughes Aircraft 
Company's Radar Systems Division from 1963 to 
1980. where be became a senior scientist In 1976. 
Since 19B0, he has served as a professor and chair 
man ol the Department ol Electrical Engineering al 
Goivaga University He is a member of the IEEF 
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education A U S Navy veteran, Ray was born in Bill- 
ings. Montana He is married and has seven children 
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